The importance of free software to science
[LWN subscriber-only content]
Welcome to LWN.net
The following subscription-only content has been made available to you by an LWN subscriber. Thousands of subscribers depend on LWN for the best news from the Linux and free software communities. If you enjoy this article, please consider subscribing to LWN. Thank you for visiting LWN.net!
Free software plays a critical role in science, both in research and in disseminating it. Aspects of software freedom are directly relevant to simulation, analysis, document preparation and preservation, security, reproducibility, and usability. Free software brings practical and specific advantages, beyond just its ideological roots, to science, while proprietary software comes with equally specific risks. As a practicing scientist, I would like to help others—scientists or not—see the benefits from free software in science.
Although there is an implicit philosophical stance here—that reproducibility and openness in science are desirable, for instance—it is simply a fact that a working scientist will use the best tools for the job, even if those might not strictly conform to the laudable goals of the free-software movement. It turns out that free software, by virtue of its freedom, is often the best tool for the job.
Reproducing results
Scientific progress depends, at its core, on reproducibility. Traditionally, this referred to the results of experiments: it should be possible to attempt their replication by following the procedures described in papers. In the case of a failure to replicate the results, there should be enough information in the paper to make that finding meaningful.
The use of computers in science adds some extra dimensions to this concept. If the conclusions depend on some complex data massaging using a computer program, another researcher should be able to run the same program on the original or new data. Simulations should be reproducible by running the identical simulation code. In both cases this implies access to, and the right to distribute, the relevant source code. A mere description of the algorithms used, or a mention of the name of a commercial software product, is not good enough to satisfy the demands of a meaningful attempt at replication.
The source code alone is sometimes not enough. Since the details of the results of a calculation can depend on the compiler, the entire chain from source to machine code needs to be free to ensure reproducibility. This condition is automatically met for languages like Julia, Python, and R, whose interpreters and compilers are free software. For C, C++, and Fortran, the other currently popular languages for simulation and analysis, this is only sometimes the case. To get the best performance from Fortran simulations, for example, scientists often use commercial compilers provided by chip manufacturers.
Document preparation and preservation
The forward march of science is recorded in papers which are collected on preprint servers (such as arXiv), on the home pages of scientists, and published in journals. It's obviously bad for science if future generations can't read these papers, or if a researcher can no longer open a manuscript after upgrading their word-processing software. Fortunately, the future readability of published papers is enabled by the adoption, by journals and preprint servers, of PDF as the universal standard format for the distribution of published work. This has been the case even with journals that request Microsoft Word files for manuscript submission.
PDF files are based on an open, versioned standard and will be readable into the foreseeable future with all of the formatting details preserved. This is essential in science, where communication is not merely through words but depends on figures, captions, typography, tables, and equations. Outside the world of scientific papers, HTML is by far the dominant markup language used for online communication. It has advantages over PDF in that simple documents take less bandwidth, HTML is more easily machine-readable and human-editable, and by default text flows to fit the reader's viewport. But this last advantage is an example of why HTML is not ideal for scientific communication: its flexibility means that documents can appear differently on different devices.
The final rendering of a web document is the result of interpretation of HTML and CSS by the browser. The display of mathematics typically depends on evolving JavaScript libraries, as well, so the author does not know whether the reader is seeing what was intended. The "P" in PDF stands for "portable": every reader sees the same thing, on every device, using the same fonts, which should be embedded into the file. The archival demands of the scientific record, combined with the typographic complexity often inherent to research papers, requires a permanent and portable electronic format that sets their appearance in stone.
To aid collaboration and to ensure that their work is widely readable now and in the future, scientists should distribute their articles in the form of PDF files, ideally alongside text-based source files. In mathematics and computer science, and to some extent in physics, LaTeX is the norm, so researchers in these fields will have the editable versions of their papers available as a matter of course. Biology and medicine have not embraced the culture of LaTeX; their journals encourage Word files (but often accept RTF output). Biologists working in Word should create copies of their drafts in one of Word's text-based formats, such as .docx or .odt; though these files may not be openable by future versions of Word, their contents will remain readable. Preservation of text-based, editable source files is essential for scientists, who often revise and repurpose their work, sometimes years after its initial creation.
Licensing problems
Commercial software practically always comes with some form of restrictive license. In contrast with free-software licenses, commercial ones typically interfere with the use of programs, which often throws a wrench into the daily work of scientists. The consequences can be severe; software that comes with a per-seat or similar type of license should be avoided unless there is no alternative.
One sad but common situation is that of a graduate student who becomes accustomed to a piece of expensive commercial analytical software (such as a symbolic-mathematics program), enjoying it either through a generous student discount or because it's paid for by the department. Then the freshly-minted PhD discovers the real price of the software, and can't afford it on their postdoc salary. They have to learn new ways of doing things, and have probably lost access to their past work, which is locked up in proprietary binary files.
A few months ago, an Elsevier engineering journal retracted two papers because their authors had used a commercial fluid-dynamics program without purchasing a license for it. The company behind the program regularly scans publications looking for mentions of its product in order to extract license fees from authors. In these cases, the papers had already been cited, so their retraction is disruptive to scholarship. Cases such as these are particularly clear examples of the potential damage to science (and to the careers of scientists) that can be caused by using commercial software.
In addition, certain commercial software products with per-seat licensing "call home" so that the companies that sell them can keep track of how many copies of their programs are in use. The security implications of this should be obvious to anyone, yet government organizations, while adhering minutely to security rituals with questionable efficacy, permit their installation. While working at a US Department of Defense (DoD) lab, I was an occasional witness to the semi-comical sight of someone running around knocking on office doors, trying to find out who was using (or had left running) a copy of the program that they desperately needed to use to meet some deadline—but were locked out of.
Software rot
Ideally scientists would only use free software, and would certainly avoid "black box" commercial software for the various reasons mentioned in this article. But there is another category that's less often spoken of: commercial software that provides access to its source code.
When I joined a new project at my DoD job, the engineer that I was supposed to work with was at a loss because a key software product had stopped working after he upgraded the operating system (OS) on his workstation. The operating system couldn't be downgraded and the company was no longer supporting the product. I got a thick binder from him with the manual and noticed a few floppy disks included. These contained the source code. Right at the top of the main program was a line that checked the version of the OS and exited if it was not within the range that the program was tested on. I figured we had nothing to lose, so edited this line to accept the current OS version. The program ran fine and we were back in business.
The point of this anecdote is to illustrate the practical value of access to source code. Such proprietary but source-available software occupies an intermediate position between free software and the black boxes that should be strictly avoided. Source-available software, although more transparent, practical, and useful than black boxes, still fails to satisfy the reproducibility criterion, however, because the scientist who uses it can't publish or distribute the source; therefore other scientists can't repeat the calculations.
Software recommendations
The following specific recommendations are for free software that's potentially of use to any scientist or engineer.
Scientists should, when practical, test their code using free compilers, and use these in preference to proprietary options when performance is acceptable. For the C family, GCC is the venerable standard, and produces performant code. A more recent but now equally capable option is Clang.
For Fortran, GFortran (which is a
front-end for GCC) is a high-quality compiler and the standard free-software choice. Several more recently developed alternatives
are built, as is Clang, on LLVM. To avoid
potential confusion, two of these are called "Flang". Those interested in
investigating an LLVM option should follow the project called (usually) "LLVM Flang", which is written from
scratch in C++, and was renamed to "Flang" once it became part of the LLVM
project in 2020. Its GitHub page
warns that it is "not ready yet for production usage
", but this is probably
the LLVM Fortran compiler of the future. Another option to keep an eye on
is the LFortran compiler. Although
still in alpha, this project (also built on LLVM) is unique in providing a
read-eval-print loop (REPL) for Fortran.
For those scientists not tied to an existing project in a legacy language, Julia is likely the best choice for simulation and analysis. It's an interactive, LLVM-based, high-level expressive language that provides the speed of Fortran. Its interfaces to R, gnuplot and Python mean that those who've put time into crafting data-analysis routines in those languages can continue to use their work.
Although LaTeX is beloved for the quality of its typesetting, especially for mathematics, it is less universally admired for the inscrutability of its error messages, the difficulty of customizing its behavior using its arcane macro language, and its ability to occasionally make simple things diabolically difficult. Recently a competitor to LaTeX has arisen that approaches that venerable program in the quality of its typography (it uses some of the same critical algorithms) while being far easier to hack on: Typst. Like LaTeX, Typst is free software that uses text files for its source format, though Typst does also have a non-free-software web application. Typst is still in alpha, and so far only one journal accepts manuscripts using its markup language, but its early adopters are enthusiastic.
A superb solution for the preparation of documents of all types is Pandoc, a Haskell program that converts among a huge variety of file formats and markup languages. Pandoc allows the author to write everything in its version of Markdown and convert into LaTeX, PDF, HTML, various Word formats, and more. Raw LaTeX, HTML, and others can be added into the Markdown source, so the fact that Markdown has no markup for mathematics (for example) is not an obstacle. The ability to have one source and automatically create a PDF and a web page, or to produce a Word file for a publication that insists on it without having to touch a "what you see is what you get" (WYSIWYG) abomination, greatly simplifies the life of the writer/scientist. Pandoc can even output Typst files, so those who use it are ready for that revolution if it comes.
Conclusion
The goals of the free-software movement include ensuring the ability of all users of software to form a community enriched and liberated by the right to study, modify, and redistribute code. The specific needs of the scientific community bring the benefits of free software into clear focus and they are critical to the health and continued progress of science.
The free-software movement has an echo in the "open-access movement", which is centered around scientific publication and began in the early 1990s. It has its origins in the desire of scientists to break free of the stranglehold of the commercial scientific publishers. Traditionally, those publishers have interfered with the free exchange of ideas, while extracting reviewer labor without compensation and attaching exorbitant fees to the access of scientific knowledge. Working scientists are aware of the movement, and most support its aims of providing free access to papers while preserving the curation and quality control inherited from traditional publishing. It is important to also continue to nourish awareness of the crucial role that free software plays throughout the scientific world.
Index entries for this article | |
---|---|
GuestArticles | Phillips, Lee |
Engineers need to do better
Posted Jun 4, 2025 14:26 UTC (Wed)
by willy (subscriber, #9762)
[Link] (3 responses)
Posted Jun 4, 2025 14:26 UTC (Wed) by willy (subscriber, #9762) [Link] (3 responses)
I found some code which had been written against Python 2.6 and would have needed substantial changes to make it work with 2.7. Part of that was using a library which was no longer available. And I'm no Python expert, so I just gave up.
You're right that open source software gives us an advantage, but we have to have a better legacy story than this! Whether that's preserving digital artifacts better or having a better backward compatibility story or something else ...
Engineers need to do better
Posted Jun 4, 2025 16:07 UTC (Wed)
by fraetor (subscriber, #161147)
[Link] (1 responses)
Posted Jun 4, 2025 16:07 UTC (Wed) by fraetor (subscriber, #161147) [Link] (1 responses)
Software is an area where university based researchers often struggle more than their industry counterparts, largely due to the short term nature of university funding and contracts, and a focus on publication output for promotion, etc. Over the past few years a number of UK universities have established a central pool of RSEs, often employed on a permanent basis, to mitigate this.
However, while there is a lot of focus around reproducibility, especially in the context of FAIR [2], it does seem that a lot of the effort is going towards freezing all the dependencies and effectively reproducing the original environment, whether through conda environments, containers, or VMs. I guess it is the difference between single paper analytics, and creating a reusable analytics tool.
Software is more essential to science than ever before, so this is definitely an area to keep on improving.
[1]: J. Cohen, D. S. Katz, M. Barker, N. Chue Hong, R. Haines and C. Jay, "The Four Pillars of Research Software Engineering," in IEEE Software, vol. 38, no. 1, pp. 97-105, Jan.-Feb. 2021, https://doi.org/10.1109/MS.2020.2973362
[2]: Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18
Engineers need to do better
Posted Jun 4, 2025 22:01 UTC (Wed)
by kleptog (subscriber, #1183)
[Link]
Posted Jun 4, 2025 22:01 UTC (Wed) by kleptog (subscriber, #1183) [Link]
The projects at $DAYJOB I find the most fun are when I'm given a pile of code written by some researcher or analyst to solve a problem to turn it into something usable. The joy on their faces when you spend an afternoon restructuring their code to produce something more readable and reliable and completes in a fraction of the time is priceless.
Engineers need to do better
Posted Jun 4, 2025 17:21 UTC (Wed)
by fenncruz (subscriber, #81417)
[Link]
Posted Jun 4, 2025 17:21 UTC (Wed) by fenncruz (subscriber, #81417) [Link]
This means that most software written by scientists is of poor quality and does not use best practices which will let it run on future substrates.As someone who has made the move between academia and industry I can assure you that people can and do write bad code in any
LyX
Posted Jun 4, 2025 16:24 UTC (Wed)
by joib (subscriber, #8541)
[Link]
Posted Jun 4, 2025 16:24 UTC (Wed) by joib (subscriber, #8541) [Link]
Reproducibility
Posted Jun 4, 2025 22:08 UTC (Wed)
by randomguy3 (subscriber, #71063)
[Link] (3 responses)
Posted Jun 4, 2025 22:08 UTC (Wed) by randomguy3 (subscriber, #71063) [Link] (3 responses)
I think the bigger win for using free software (along with open data) is making it harder to hide flawed analysis. When different experiment appear to disagree, it can provide a way of investigating why, and whether something underhanded (or incompetent) has been going on.
Reproducibility
Posted Jun 5, 2025 0:20 UTC (Thu)
by pizza (subscriber, #46)
[Link] (1 responses)
Posted Jun 5, 2025 0:20 UTC (Thu) by pizza (subscriber, #46) [Link] (1 responses)
Without the former, there is no point in even attempting the latter.
Let's say your new implementation produces different results with the same data. Maybe the problem is with your implementation, maybe the problem is with the algorithm, or maybe the problem is actually in the original, casting the original conclusions into question.
You often (usually?) don't know what differences from the original may turn out to be material, so ideally you'd try to precisely recreate the original results (ie with the same input and exactly-as-described procedure) before changing anything..
Reproducibility
Posted Jun 5, 2025 10:40 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
The other critical point IME is that it's not unknown for algorithms to have external dependencies that aren't documented because "everyone" uses the same setup.
Posted Jun 5, 2025 10:40 UTC (Thu) by farnz (subscriber, #17727) [Link]
For example, I've had to work with code that started failing when recompiled for an SGI machine, rather than the Windows NT box it had been written for, and the Debian x86 boxes it had been ported to successfully. We instrumented the code base, and determined that the SGI machine was underflowing intermediates in floating point calculations, where the same code compiled for Windows or Debian was not - because intermediates were being kept in extended precision x87 stack slots, where the SGI was using double precision FPU registers.
Without the ability to compare, I'd just have a chunk of C code that the original author claimed worked "just fine" (and was backed by other users, who also found it working "just fine" on Linux and Windows NT), but which failed on my target system. With the ability to compare, I could find the problem, determine that the hidden assumption is that all FPUs are x87-compatible, and then work with the originator on a fix.
Reproducibility
Posted Jun 5, 2025 9:42 UTC (Thu)
by magi (subscriber, #4051)
[Link]
Posted Jun 5, 2025 9:42 UTC (Thu) by magi (subscriber, #4051) [Link]
I think that in order to check reproducibility you need automatic tests. Chances are that you will end up with results that are not strictly the same (as in binary equality). The question is, are the results significantly different whatever that might mean in this context. I think this is really one of the big differences between scientific software and "normal" software. In addition what is the correct result anyway, given that simulations are used to explore a system that cannot be tackled otherwise. Another issue with testing scientific software is that the tests might require huge resources (think weather model) or closed data (think medical data).
Having good tests will allow people to convince themselves the software is doing what it is supposed to. It will also help port the software to new systems and deal with the software rot.
paper production
Posted Jun 5, 2025 9:51 UTC (Thu)
by magi (subscriber, #4051)
[Link]
Posted Jun 5, 2025 9:51 UTC (Thu) by magi (subscriber, #4051) [Link]
Using markdown for writing papers is great. This works particularly well when used with gitlab/github for collaboration and CI/CD pipelines to automate spell checking and pdf production. Obviously LaTeX works equally well but the syntax is slightly heavier.
I missed in the article mentioning of reproducible documents. I think the idea comes from the R world where the document contains the code that can be run to produce the outputs for the document. Quarto supports many languages including python and R, uses pandoc to produce the output and can be edited using jupyter notebooks.
Scheduling influences on simulation outcomes?
Posted Jun 5, 2025 10:37 UTC (Thu)
by taladar (subscriber, #68407)
[Link] (1 responses)
Posted Jun 5, 2025 10:37 UTC (Thu) by taladar (subscriber, #68407) [Link] (1 responses)
I am thinking of things like not using shared PRNGs from multiple threads where the scheduling order might then give each thread different parts of its (otherwise deterministic) output depending on which thread is scheduled first.
But beyond that some auto-configuration of the program might also be a problem, e.g. detecting the RAM size or CPU core count and scaling operations by that by spawning more threads or processing larger batches at a time.
Forwards-compatibility is hard
Posted Jun 5, 2025 11:20 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
You also get into having to think about what might be different in the future; for example, I've seen problems with an algorithm in the late 1990s that assumed x87 FPUs and was reproducible on all x86 CPUs of the era, but not on future x86 family CPUs (using SSE2 instead of x87), or on MIPS CPUs.
Posted Jun 5, 2025 11:20 UTC (Thu) by farnz (subscriber, #17727) [Link]
And then there's things like using 32 bit counters because you can't overflow them in reasonable time; this is true when things are slow enough, but as they get faster, it can become false. For example, an Ethernet packet is a minimum of 672 bit times on the wire; in 1995, a 32 bit packet counter represented over 8 hours of packets at the maximum standardised rate. However, today's maximum standardised rate (from 2024) is 800 Gbit/s, or overflow in a bit over 3.6 seconds.
The best we can reasonably ask for is that it's possible to follow your documentation and reproduce your results - that can include documenting the hardware, OS and other details of the system you produced the result on. That way, it becomes possible for a future reimplementation of your algorithm to do the A/B comparison between your system, and their new system, even if they've had to get help from a museum to build up the required hardware to reproduce your results.
.odt
Posted Jun 5, 2025 10:52 UTC (Thu)
by grawity (subscriber, #80596)
[Link]
Posted Jun 5, 2025 10:52 UTC (Thu) by grawity (subscriber, #80596) [Link]
In particular because .odt isn't even a Word format – it's the OpenOffice/LibreOffice format that Word only has secondary support for.
Though compatibility of LibreOffice with Word-produced .docx vs Word-produced .odt might well be the same these days?