RCT #175 - Digital Humanities’ LLM Edge? Plus: EOLing a Product; Product vs Project; Directing Contributors Where Needed; Commercial Products and Reproducibility; Federated Analysis Primer; And the Job Board Returns, Maybe?
Happy 2024!
RCT took a bit of an unplanned long holiday break, due to life intervening; all’s well now, and I’m looking forward to the year ahead. The next few issues might be a bit short as I get back into the swing of things; and there’ll be some modest changes (I’ll be slowly bringing back the jobs board, which I’m excited about!)
As always, please email me (hit reply, or email jonathan@dursi.ca) if you have any suggestions, thoughts, or comments. I have a bit of a backlog of emails to reply to, but I read everything and will reply! Or you can always set up a 15 minute chat with me to give me your comments.
Here’s something I’d like to hear from the community about, by email or chat.
This is anecdotal so far, but I’m starting to see a pattern of anecdotes that suggests to me that teams with significant digital humanities skills are getting researchers up to speed faster in using existing LLM tooling for research. Are you seeing this, too?
So when I think about LLMs specifically in research, I’m thinking of three broad categories:
Research in AI or LLM methods
Can be super inside-baseball stuff like quantizations, architectures, alternatives to attention…
Often but not necessarily driven by novel application areas or data types
Research about AI or LLMs
Ethics of data collection, philosophy, how LLMs will affect jobs
Research with AI or LLMs
Using existing methods and services to accelerate current research projects or make new projects feasible
Analyzing large amounts of text, videos, images
Writing, summarizing
Speeding code development…
I've seen several times now, research computing or research software or research data groups with digital humanities expertise seem to be doing a better, faster job of onboarding researchers to productive uses of LLMs for their research(things like feeding papers or job postings into LLMs for efficient and rapid analysis).
In particular, I’m thinking about work like this paper, “Measuring remote work using a Large Language Model (LLM)”which was written in up HBR both for its results (significant inequality in who gets to work remotely - its disproportionately higher paid jobs with higher educational requirements) and its methodology. Basically the work used LLMs and humans to categorize the jobs described in 10,000 job ads, and cross-checked them.
While there's a lot of emphasis and interest in our community right now on category 1, supporting research on AI or LLM(like developing new deep learning methods for physics problems), I haven’t seen as much emphasis on the more prosaic and quotidian use of already existing (and often free or cheap) LLM services to get research done, low-hanging fruit under category 3.
But our teams' mission is to increase scientific impact as much as possible given the resources available, and I get the sense that we should be thinking about this more.
Staff with digital humanities experience seem to be leaping into action for this kind of work quickly than more traditional scientific computing teams, and in many cases they seem to be really kickstarting some of these efforts. Part of this may just be being able to speak the language of social sciences and humanities very fluently, where a lot of this work is happening (since analyzing textual data is more common there). Part might have something to do with existing experience with and intuition for NLP efforts and web services.
And of course, part might just be an artifact of small number statistics of the sorts of conversations I’m having these days.
I’d be very interested in hearing from you - are your teams supporting category-three LLM-powered efforts right now, and are digital humanities staff involved? Let me know!
With that, on to the roundup.
Managing Teams
Over at Manager, PhD in the previous issue I had covered articles on
The importance of requests being clear, even small ones
Making hard decisions easier (and more clearly communicated) by explicitly choosing A even over other good things
Product Management and Working with Research Communities
Discontinuing a Research Software Project - Mark C. Miller, Better Scientific Software
The content of this article is perfectly useful and sensible, well worth reading, but the framing and word choice might as well have been grown in a lab to provoke me into a rant.
So yes, content first. Miller tells us we have to get started with the ending, so that everything can be closed out as usefully as possible and all the good that a come out of the work does come out. There’s outlined a good set of steps for closing out a research software product:
Make an end-of-project release
Open-source the code, if not done already
Document final status
Gather lessons learned
Establish an enduring on-line presence with all this material
Present or publish - all that final documentation will help for this
Refactor critical dependencies if some pieces of the software product would be independently useful
Archive the software
These are all great. In particular, “lessons learned” are really useful to the community, and the idea of pulling out independent pieces for possible separate use (and other options for sustained development) is a good one I don’t see mentioned often.
Ok, now for the other part.
Look, as a community we need to get our stuff together about the very basic concepts that underly the work we do and how we do it.
Someone’s going to email me saying I’m harping on something that’s “just semantics” - let’s head that right off here. Anyone who tells you that semantics are trivial is someone you shouldn’t listen to. We cannot communicate meaningfully, we even have a hard time even thinking clearly, without well-defined concepts and clear names for them.
Projects are how we organize doing our work. The work we do is to deliver products and services.
This matters.
All projects end. Projects are by definition finite bundles of work with a predetermined end. In a software context, a project might be a sprint or the addition of a new set of functionality, or a rewrite, or even a new set of documentation and examples. We’re going to do XYZ, and then we’ll be finished, and that’s the project. Fin.
The software itself is a product that we offer to the community. Or we offer services around the software, or around the creation of the software. (Or systems, or data analyses, or…) They involve ongoing work and operations, they require discovery and implementation and delivery and feedback loops. There’s no predefined end to a product, or our offering of a service.
Somehow in our community it’s seen as too tawdry, or common, or transactional, to say out loud that we provide products and services to research groups. The words echo discordantly in the hallowed and monastic halls of academia. But products and services are what we do, we’re good at it, and the products and services we provide enormously accelerate scientific progress.
If “product” is some dirty word that we can’t mention even amongst ourselves, we cut ourselves off from lots of perfectly good literature and books and blog posts about, for instance, product life cycles and how to wind down a product. If we refuse to think rigorously about services, thinking it’s too “commercial”, we’re going to struggle to explain what we do or to have the impact on science that our team could have.
Our funders have some stuff to answer to about all this. Too many of our funding mechanisms are still really set up for funding small collections of experiments, which are inherently projects. So we have to kind of frame the work we do - which is really about programmes and operations and products - as if it were a discrete set of projects.
But look, just because we have twist words to explain our work to funders in ways they can understand and fund doesn’t mean we have to keep using that warped diction when we’re talking amongst ourselves. We offer products and services, and we do the work necessary to deliver those products and services by bundling tasks into projects.
Research Software Development
Directing new contributors to areas of need - Ben Cotton
It should go without saying, but for those of us who maintain open source code and are open to external contributions, it makes sense to make it as easy as possible for contributors to add the stuff you want added! Cotton suggests going beyond (say) Hacktoberfest recommendations by spelling out directions you want the code to go, non-code contributions you’d like (social media, specific documentation, examples), and keeping track of contributors and their skills so you can ask them for specific contributions.
The low-hanging fruit in computational reproducibility - Konrad Hinsen
Hinsen opens with an eye-catcher:
Here's a provocative proposition: we can solve computational reproducibility for a big majority of those 92% of researchers by buying them a license for Mathematica.
And I think puts his finger squarely on why this seems so provocative:
All the activities around software in Open Science are organized by and for people who work in computational science, […] On the other hand, most of the 92% of researchers who depend on software do computer-aided research but not computational science. Their main tools are instruments or mathematical theories. They use computers as auxiliary tools, mostly for routine data analysis tasks. […] The people who contribute to Open Source projects for scientific software have overall the same profile as the participants of software-for-open-science events. [LJD: and so are not representative of most users’ needs]
I’ve written here before about how under-discussed commercial options or fee-for-service semi-public options are in our community, and this is a thought provoking discussion starter on the topic specifically in the context of reproducibility.
Some python news - you may have already seen this, but there’s now a JIT in Python 3.13 - impact initially is modest but this is an important start.
Research Data Management and Analysis
Federated Analysis for Privacy-Preserving Data Sharing: A Technical and Legal Primer - James Casaletto, Alexander Bernier, Robyn McDougall, and Melissa S. Cline, Ann Rev Genomics and Genetics
As more of our teams are being asked to work with sensitive data, federating data becomes an interesting way of gaining the analytics benefits of pooling data without actually pooling data. But it’s not a panacea, and there’s a lot to consider. This is a good review in the context of health data (and genomics in particular) but the basic concepts and concerns are explained well here and are quite broadly applicable.
Ten simple rules for organizations to support research data sharing - Champieux et al, PLOS Comp Bio
If you build it - and do nothing else - they are absolutely not going to come. Building technical data sharing infrastructure is by far the easy part. The authors here talk about a range of other considerations to make sure that data sharing infrastructure is used, including:
Understand who will be using this, and why
Do communications work with those desired users well ahead of time
Make it easy by providing workflows and procedures
Get governance right from the beginning
Incorporate data ethics best practices
Lots of education and communication
Emerging Technologies and Practices
There have been a lot of quantum announcements since I last wrote you - the Finnish Quantum Flagship project, and partnerships between Pawsey and QuEra for quantum computing, MILA and Pasqal for quantum AI, and US Geological Survey and Q-CTRL for quantum sensing and quantum computing. Also, STFC Hartree is now a maintainer for IBM’s Qiskit, and LANL is collaborating with D-Wave. Besides just the slow but steady rise in interest in quantum computing despite everything else going on, these private-public partnerships seem to be increasingly common for quantum in a way I don’t remember them being for more general computing or even for AI or the Big Data that preceded it.
Random
Yes, you can define PDF pages larger than Germany.
Generating random numbers by hand, if you want to, which you shouldn’t.
Writing a TUI in bash, if you want to, which you shouldn’t.
SSH implemented over HTTP/3, if you want to, which maybe you shouldn’t?
I missed this when it came out - Google created a model for reading data off of graphs, and my current employer has a playground for using it for free (there was a hugging face one too but doesn’t seem to be working properly as I write this). A lot better than those plot digitizer draw-dots-by-hand approaches I’m used to.
For traditional numerical computing folks who are thinking of getting some into LLMs in some way, I now usually recommend starting from the outside in, working on tooling needed to use them effectively in (say) information retrieval rather than going straight into neural architectures or something. There’s been a couple really good review papers on vector databases and vector retrieval generally that are worth reading; that area is probably pretty fruitful and accessible for many of us, (especially if you worked on particle methods!), and has applications broader than just LLMs.
Relatedly, here’s a free short course on LLMops, some of the content is also more applicable to MLops generally.
If you do want to tackle LLMs directly, this material is well thought of.
Replacing suid binaries with SSH over unix sockets.
Checkpoint/restore for using AWS spot instances for HPC.
SQL2023 is finished, with a number of small changes that clean up various pieces of the standard.
That’s it…
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
Jonathan
About This Newsletter
Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.
So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.
This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.
Jobs Leading Research Computing Teams
This week’s new-listing highlights are below; the full listing of 17 jobs is, as ever, available on the job board.
Director of Illinois Computes - The National Center for Supercomputing Applications - University of Illinois, Urbana IL USA
NCSA is seeking a highly motivated leader to serve as the Director of Illinois Computes. The Director of Illinois Computes will provide coordination and leadership for the development and implementation of the Illinois Computes Program which includes existing long term investments from the University of Illinois System as well as the Urbana-Champaign campus. This will include integrating short-term tactical directions and long-range strategic directions of the program, establishing goals and timelines with program staff, and ensuring that staff and resources are directed efficiently toward achieving these goals within the specified timelines. In addition to directing the day-to-day operations of Illinois Computes, this position will provide oversight and management of assigned projects and/or programs operated by NCSA on behalf of the campus, national research communities, funding agencies, and other NCSA partner institutions
Research Engineering Manager - Thomson Reuters, London UK
As a Lead Research Engineer at Thomson Reuters Labs, you will be part of a global interdisciplinary team of experts. We hire engineers and specialists across a variety of AI research areas to drive the company’s digital transformation. The science and engineering of AI are rapidly evolving. We are looking for an adaptable learner who can think in code and likes to learn and develop new skills as they are needed; someone comfortable with jumping into new problem spaces; who enjoys directing and supporting the efforts of others.
Director, Quantum Computing Applied Research - NVIDIA, Santa Clara CA USA
As Director for Quantum Computing Applied Research, you'll lead our technical path-finding efforts, working with cross functional teams in Product, Engineering and Applied Research to develop innovative technologies that advance the state of Quantum Computing and intercept NVIDIA products.
The Quantum Computing organization is a small, strong, and visible group both inside and outside of NVIDIA while Quantum Information Science is an exciting area to drive strategy. We need a self-starting leader to continue to grow this area.
High Performance Computing and Research Data Services Manager- Vassar College, Poughkeepsie NY USA
The High Performance Computing and Research Data Services Manager at Vassar College is a critical role in supporting research excellence and data management within the academic community. This position is responsible for managing and optimizing the high-performance computing (HPC) infrastructure and ensuring the efficient and secure management of research data and storage. The role involves collaborating with researchers, faculty, and students to enable cutting-edge projects while maintaining the integrity and accessibility of research data.
The High Performance Computing and Research Data Services Manager reports to the Enterprise Systems team within Computing & Information Services and will work as a collaborative member of the Systems team to ensure operational continuity and support for HPC, Data Science, and other research-related computing and storage solutions. Computing & Information Services engages in cooperative efforts with the ACCAS Steering Committee, the Data Science & Society (DSS) Steering Committee, and various other academic departments to identify and implement priorities related to research computing.
Manager, High-Performance Computing - Columbia University, New York NY US
Reporting to the Sr. Director of Research Services; the Manager of High-Performance Computing (HPC) leads the HPC team and interacts extensively with leaders of other research support groups including the Office of the Executive VP for Research and other CUIT groups. Key initiatives to be accomplished by the incumbent include the advancement of a shared High-Performance Computing platform, the advancement of research computing in the cloud, assessment, and development of research tools for computational performance. Additionally, the Manager leads collaboration efforts with other central research support entities at Columbia University.
College of Engineering Research Computing Lead - University of Wisconsin at Madison, Madison WI USA
We need a a leader to help refine and grow our research computing services in the UW-Madison College of Engineering (CoE). This position will guide strategy to improve research computing efforts across the college, collaborating with technical colleagues, campus-level cyberinfrastructure efforts, and compliance professionals.
You will engage with researchers to understand their technology needs, translate that into service models, and drive implementation of those services. You will network with similar professionals across the UW-Madison campus to facilitate efficient use of computing services to advance research. In addition, you will work with wonderful colleagues to improve our readiness for audits around security, research integrity, and data protection.
Research Computing Support Lead (hybrid) - Northwestern University, Evanston, IL
As a Research Computing Support Lead, you will lead, develop, and supervise a team of staff and student computational consultants in delivering user support to researchers across Northwestern through tickets, consultations, workshops, and documentation. You will work with researchers and colleagues within and beyond Northwestern to understand emerging trends. You will oversee the development and implementation of necessary improvements and operational processes for support services to the meet growing and evolving research needs across the University.
Strong candidates for this role will have experience in supervising or coaching staff/students/interns, be experienced in leading or managing a service, and have a demonstrated record of building relationships.