Introductory talk welcoming you to RSE2017, providing overview information about the conference, plus an overview of the programme for the two days.
Short talk giving a summary of the results of the 2017 RSE Survey.
Abstract to appear soon!
We are on the verge of the biggest revolution in computing since Alan Turing formalised the concepts of computation. Today software is hand-made, relying on highly-skilled craftspeople to intricately tell computers how to do useful work. A Cambrian explosion in computing is imminent, as artificial intelligence and machine learning are enabling computers to learn from data. But what is the reality beyond the hype? In this talk we will show the huge range of what is possible now, without having to have a specialist PhD in AI or machine learning. We will demonstrate how you can accelerate your current research projects with AI, and how you can set yourself up to be tomorrow’s RSE AI superstar, across research domains such as chemistry, engineering, environmental and earth sciences, genomics, humanities, physics, and social sciences.
I will discuss some aspects of funded research projects that involve large internationally developed open source software packages. These will cover all stages:
– developing a grant application (e.g. how to ensure that the project will be well received by the community; how to describe OSS project in the IP rights section of the grant application; how to respond to the claims that being OSS is a risk factor)
– implementing the project (e.g. soliciting contributions from wider community; managing dependencies between funded and non-funded activities; running projects that involve multiple OSS projects)
– follow-up activities (ensuring that new developments are merged into the mainstream version of the software; providing necessary support for them, etc.).
The talk will reflect my experience from several projects of which I am a part, including the GAP system (http://www.gap-system.org) and the EU Horizon 2020 project OpenDreamKit (http://opendreamkit.org).
At RSE 2016, Scott presented some of CANARIE’s efforts over the last decade to make research software development more efficient through co-development and reuse. Following on this theme, he will describe the team’s experiences in implementing two new initiatives to further support re-use of research software.
In early 2017, CANARIE launched a funding call in which groups who had existing research software were funded to adapt that software for use by others, including those in different disciplines. Scott will discuss the parameters and the outcomes of this call, including an approach to fund long-term maintenance of research software to support multiple teams.
In parallel, CANARIE’s software team initiated a study of existing research software frameworks specifically designed for re-use. The goal is to understand how a permanent software development team’s expertise in infrastructure could assist research software developers by allowing them to focus on science-facing software. Scott will also present the progress made on this activity and plans for the future.
Doing reproducible research is hard, but worthwhile. There are no black boxes; we can trace every result in a paper back to its source data. The process from source data to result can be long and complex; the price of this transparency is that we can no longer hold our toolchains together with sellotape and string.
Good software engineering practice improves the robustness of the toolchain. A logical extension of this is to treat the manuscript itself as an integrated part of the software project. Tools such as Knitr allow us to include the code that produces the results of our analysis in the LaTeX source of the manuscript.
Through a case study, we explain how we have done this, using a combination of Makefiles, Docker images and Knitr. We will discuss some of the challenges in this approach, such as managing complex software dependencies and scripting our analysis steps.
Significantly, this approach makes the contribution of RSEs to the research process obvious. As the publication output is dependent on the software, RSEs are automatically credited.
The UK Met Office’s LFRic project is developing a successor to its current Unified Model (UM). A key part of this work is ensuring performance portability over the expected decades-long lifetime of the model (the existing UM is over 20 years old). LFRic is aiming to achieve this through the use of a Separation of Concerns which enables the code dealing with the natural science (meteorology) to be kept separate from that related to computational science (performance). As part of this, a Domain Specific Language has been developed which then permits automatic generation of the parallel aspects of the application. This code translation and generation is performed by the PSyclone tool, developed by STFC’s Hartree Centre.
We describe the architecture and development of PSyclone as well as the interface and working practices used by the LFRic developers. We then move on to describe how a HPC expert can use PSyclone to implement architecture-specific optimisations while leaving the science code base unchanged – it is this functionality that permitted the LFRic code to move from purely sequential execution to running on more than 50,000 cores in the space of a week.
Intel’s Knight’s Landing (KNL) generation Xeon Phi chip presents a particular challenge for parallel software development, with a large number of cores, threads per core and wide vector processing units. Throw into the mix a confusing list of configuration options, from tuning sub-numa regions to selecting cache or flat mode for it’s on board high bandwidth memory and fully exploiting this hardware becomes even more challenging. Additionally what works best for a single node may not be the best option when scaling a code over a large system. For example what is the correct balance between MPI and OpenMP within a node to optimise internode communications whilst maintaining single node performance at an acceptable level? In this talk the Intel Parallel Computing Centre (IPCC) based at The Hartree Centre will share their experiences of scaling DL_POLY_4, a research community code, across large numbers of KNL nodes on their new machine, Scafell Pike. The talk will cover tuning for scalability and advanced MPI and OpenMP topics such as asynchronous communication and task based parallelism.
Computer processors consist of billions of transistors working to deliver performance for your applications, but physical laws and constraints dictate how effectively we can use these transistors. The performance of GPUs continues to rise noticeably faster than traditional processors; in this talk we will look at why this should be and why the future is healthy for high throughput computing. We will look at how the single architecture developed by NVIDIA is valuable across a wide range of applications, including artificial intelligence, high performance computing, and image and video processing.
Tensor network theory has become one of the most powerful tools in theoretical and computational physics for the analysis and design of future quantum technologies, such as e.g. the quantum computer. However, tensor network theory is much more general and can also be used outside physics. In particular, it has led to new algorithms for solving partial differential equations that are much more efficient than currently existing approaches. In my talk, I will introduce our publicly available Tensor Network Theory Library:
I will present its user-friendly interface and illustrate its performance using a quantum problem as well as a partial differential equation.
The ideas of provenance, scientific workflow and reproducibility are key to the development of good scientific code. In this talk, we explore the fundamentals of Literate Programming, an idea that has been around for decades but has had surprisingly little influence in most disciplines. Of course, the idea of Notebooks is heavily based on this concept. We discuss how telling a story with your code – not only showing your results, but how you got there – can be an invaluable tool in teaching and research, and most of all in sharing your results with other researchers and colleagues. Brief examples will be shown using Jupyter Notebooks.
Mathematicians describe shapes using equations and, by manipulating those equations, can often break those shapes up into simpler pieces. We describe a project that combines ideas from geometry, string theory, and scientific computation, with the aim of finding “atomic pieces” of mathematical shapes — one can think of this as building a Periodic Table for geometry. An essential part of this involves developing new tools for exact computational algebra on a massive scale (thousands of cores; centuries of runtime; dozens of collaborators). We will discuss the challenges of developing this system: both technical (lack of existing infrastructure) and cultural (poor fit with the HPC community; the place of computational experiment in mathematics research; the challenge of publishing theorems in pure mathematics that rely on massive computations). We end by discussing a key future problem: how to make such tools easy to use for scientists who are not specialists in computation.
Okay, you got me there: storytelling is not just for RSEs… However, who can gain more from this soft skill than people at the interface between communities, for example the ones sitting between the R and the SE? After all, your ability to reach an audience containing various backgrounds is strongly dependent on the story you tell them. This talk is a reminder of the stories you can tell your colleagues and students, or more precisely a tale of how to tell a story they would also want to be a part of. The story of Science, of understanding, of sharing work, and the good practices that allow to do so.
How should we teach programming? What should we teach? What is programming anyway? This summer, I will teach an introductory programming course at the Alan Turing Institute. The course will be aimed at non-programmers and I intend to take a particular stance on the answers to these questions.
The course will be unusual in two ways: First, the attendees come from HR, Finance, Events, and other corporate functions; most will never have programmed before and indeed will have no formal mathematical training beyond secondary school. They are, however, very enthusiastic about understanding what programming is because it seems to be a large part of whatever “data science” is.
And second, we will be using Racket, a dialect of Scheme. Racket is a lovely language that I wish were more widely used. My belief is that it is also an ideal teaching language.
The course could go one of two ways. In this talk, I will tell you how it went.
In research facilities, scientists often develop software. Most of them do not have any specific education in software development. Usually they had programming courses at university or they self-taught some programming skills. Therefore their knowledge about software engineering and adjacent topics is quite limited.
To support scientists, we created a set of software engineering guidelines. These guidelines give advice in different fields of software development (e.g., requirements management, design and implementation, change management).To make it easy to start with them, we developed a simple classification scheme taking aspects into account like expected software size or software lifetime. This scheme is useful to filter the guidelines and to fit them to the right context. Besides providing written guidelines and explanations, we created check list in different formats (e.g., Markdown, Word) to offer scientist a light-weight and easy-to-use tool.
In this talk, we provide an overview about the concept of the guidelines and report about experiences introducing them at the German aerospace center (DLR) – a large research facility in Germany. At DLR around 2000 to 3000 persons develop software in part or full time.
The European Molecular Biology Laboratory (EMBL) is a diverse and modern research institute, hosting ~600 life scientists. Reflecting a general trend in biological research, the fraction of EMBL scientists devoting ≥50% of their time to computational activity grew from 39% to 46% between 2009 and 2015. These computational scientists are distributed amongst >50 research groups, with great variety in the approaches they use. This large and varied environment presents challenges for effective and efficient computational science.
Bio-IT is an initiative established to support the development and technical capability of this diverse computational biology community. The community has grown organically to tackle challenges together, providing training and support for scientific computing, creating and maintaining tools and resources for reproducible science, and providing opportunities for discussion and collaboration. Highlights include an internal system for version control and management of software development projects, a coding club, and training courses for ~400 people in the last two years. Here, we share some lessons learned while establishing this community, and discuss the efforts required to maintain it thereafter.
FFEA is a new piece of OpenMP software that uses continuum mechanics to simulate mesoscopic systems subject to thermal fluctuations. The physics based methodology has been applied to biological systems, such as molecular motors or protein aggregation, as well as to non biological systems such as the study of colloids.
Because of being research software, FFEA has this particular life cycle where constantly new researchers with different programming abilities may need to understand and alter the code. Therefore, it needs clear documentation on both usage and code sides, it needs to perform but be modular enough, and it needs an automatic testing method to check that the different modules consistently work, together with a detailed version control so that results can always be related to a specific version of the code.
In this talk I will present the software, and discuss the approaches I took
to ensure that all of these requirements are fulfilled, as well as their impact on
the sustainability of research software.
An ongoing project with the Oxford University Museums is working to improve access to visual art works via audio and haptic interfaces for people who are visually impaired. As part of the research an Android application was developed to enable the modelling of how people touch the paintings and photographs. As a sighted person, it is extremely difficult to comprehend how touch is used to explore raised images – ‘touch tiles’ – of visual art works. Working with the existing Touch Tours, provided by the Museums, and focus groups over 6 months we collected data on how touch is used when exploring the tiles, including its attentiveness to features, its exploring pattern, and its preferred touch tile material. We soon realised that we needed more detailed data on exploration pattern, and so developed an application that could track and record both pressure and movement. The tile was placed on top of a tablet screen. This application enables us to model with what exploring movements and with what exploring pressure the touch tiles are explored. It also enables us to investigate exploration time per feature, e.g. for how long a certain shape is explored. These experiments support the further development and testing of the interface
TexGen is open source software developed at the University of Nottingham for 3D geometric modelling of textiles and textile composites. It was released as open source in 2006, hosted on Sourceforge. Since then there have been more than 28,000 downloads and the software is used worldwide, as evidenced by the many publications citing its use.
It has been proposed that the TexGen project should be used as the subject of a REF impact case study. For this, evidence must be provided of more than downloads and citations; it must be shown that actual benefit has been derived from its use, for example improving business output. Simply tracking download and page view metrics is not therefore not sufficient and wider reaching efforts must be made to gather information. This talk will look at the, still ongoing, measures being taken to gather evidence of how TexGen is being used and by whom.
MERLIN is a C++ accelerator physics library, originally developed in the early 2000’s for use in linear particle collider simulations. Following a gap in both use and development, MERLIN was later adopted in 2009 by active members of the CERN High-Luminosity Large Hadron Collider project to be advanced upon for collimation-specific studies. Recent developments, circa 2010-2016, focused on obtaining physics results rather than on code design and sustainability. This has inevitably resulted in the code having an unnecessarily steep learning curve for both new users and new developers, alike. The following presents the current active developers’ recent endeavours to restructure, refactor and optimise the code such that it aligns with advocated software engineering practices. This process has focused on use case accessibility, long-term sustainability, parallelisation and scalability. More specifically, the following presents test metrics and time-investment returns, providing new information on the practical implementation of agile development practices for scientific software.
Imaging has a been a key data source in biology for hundreds of years. Modern bioimaging devices provide a wealth of electronic data, from detailed three and four dimensional microscopy images of tiny structures, to large scale images captured from drones. Extracting useful information from this deluge of data provides many interesting challenges, and requires computational approaches.
Computational bioimaging is a developing field. Existing tools tend to focus on exploratory data analysis and semi-manual measurement. As a result, it can be hard to ensure that analyses are repeatable, or scale those analyses to run on traditional high performance computing (HPC) hardware.
We’ll talk about how research software engineering can improve biological research by adding reproducibility and scalability to image analyses, giving examples of tools, libraries and projects that we’ve developed. We’ll also talk about the challenges and opportunities of being computer people in an experimental science world.
There is no doubt that container virtualisation is a useful tool for reproducible research.
An important result of its adoption is that complex, well documented environments will be accessible for others to reuse.
The demand for these portable environments will grow, especially in the long tail of science, to ease the burden of translating experiment to execution and publication.
However, the execution system itself is often overlooked when defining them.
Where do we draw the line that separates system from experiment?
For example, if an experiment requires Hadoop, do we need to distribute Hadoop with the experiment?
By encapsulating an execution model along with the code, the utility of containers can be extended beyond reproducibility. In this talk we present a framework that can be used to deploy temporary and permanent cluster software environments within containers. This approach improves the portability and enables dynamic features such as scaling and spanning, transparently of the application.
Through nested virtualisation of these environments, we can also move a step closer toward overcoming the technical constraints of facilitating resource sharing at scale – whilst satisfying the needs of every user community.
Multiphysics, multiscale scientific simulations often combine software components from multiple sources which are then “glued” together. An example of this is the HadGEM3 coupled ocean-atmosphere model developed by the Met Office Hadley Centre, which combines atmosphere, land-use, ocean, and sea-ice models with I/O and coupling libraries. The Met Office develops some components internally and others with partner institutions. Different components have different development practices and software lifecycles. My talk will cover work by RSEs at the Met Office, to ensure the combined parts work together correctly and efficiently.
It is import to check that the complete system produces the expected results in test scenarios. Most systems are continually development, and need processes in place to check that new developments do not break existing functionality. I will present an overview of creating tests for a multicomponent physical simulation. I will discuss the technical and organisational challenges encountered developing model restartability tests. I will describe how the tests complement existing component-specific tests and improve the technical infrastructure of the Met Office.
The Computational Science Centre for Research Communities (CoSeC) aims to enrich computational science and engineering research by enabling research communities to advance their work and exploit the full spectrum of local and national computing facilities. It ensures the continued development and long-term maintenance of software which makes optimum use of the whole range of hardware available to the scientific community, from the desktop to the most powerful national supercomputing facilities.
This talk will refer specifically to the CoSeC activities in the areas of interest to the UK Engineering and Physical Sciences Research Council (EPSRC). The EPSRC funds CoSec activities through a Service Level Agreement (SLA) with STFC, which delivers work undertaken by staff at its Daresbury and Rutherford Appleton Laboratories. This work has three main components:
Support for the EPSRC Collaborative Computational Projects (CCPs) by developing, maintaining and providing expertise, software and training for a large suite of codes on a range of hardware platforms. These scientific and technical efforts are complemented by the coordination of networking events and knowledge exchange for the CCP communities; for example, organising workshops, conferences, newsletters, program libraries and visits from overseas scientists.
Support for the High-End Computing (HEC) consortia organised in HEC consortia, funded by EPSRC, for distributing computer resources available at the UK national supercomputing service. This work focusses on the development of new scientific functionality in highly scalable parallel applications, often developing the high performance computing (HPC) algorithms required to make the codes developed under CCP programme suitable for deployment on the national facilities.
The Software Outlook activity, which focuses on software technologies that are vitally important to the development and optimization of world-leading scientific software. This includes evaluation of new software technologies, e.g. programming languages, libraries and techniques, that are essential for the timely and cost-effective exploitation of current and near-future systems and demonstrating how specific software technologies can be applied to existing applications.
Historically, data centres have have tended to store all the metadata and data they curate on internal databases. However, the lack of public interfaces to such databases means that the dissemination of the stored data is then a relatively labour intensive process. The British Oceanographic Data Centre has recently deployed SPARQL-endpoints for two specific use-cases: one as an API for the vast majority of metadata that it holds & curates on behalf of the UK marine science community, and one as a service to help with the implementation of the Marine Strategy Framework Directive, where a metadata portal has been built on top of a SPARQL endpoint. The portal acts as a signposting service to relevant datasets for a broad set of stakeholders. A wide range of ontologies have been used to describe the metadata in the resultant triplestores, and we show how linked data has multiple benefits for data centres & end-users in general as a result of the international standards used. The approach leads to more intuitive data searching, and greater use of linked data in data exposure could transform the way data and metadata are queried.
Pulling through the benefits of meteorological research into time-critical forecasting operations is a challenging pursuit. In recent years forecast models have increasingly focussed on the entire Earth system, coupling atmospheric prediction with ocean, land-surface, sea-ice, and atmospheric chemistry for example. Consequently, the software systems for running them are becoming increasingly multi-faceted and complex. In order to fully utilise supercomputer resources, software must be designed to be highly parallelised, adding additional complexity. A balance is needed between efficient delivery and appropriate assurance of the robustness and quality of the forecast system. In an attempt to face these challenges the European Centre for Medium-Range Weather Forecasts (ECMWF) has recently begun to review and refine its process for delivering research to operations. This presentation will discuss some of the changes made and the resulting improvements to the process, including formalising pre-merging changes, earlier testing, standardised tests, and improved communication and engagement. It is recognised that changes in culture, not just appropriate tools and working practices are important to ensure success.
The MRC Dementias Platform UK (DPUK) is a multi-million pound public-private partnership to accelerate progress in, and open up, dementias research. In this talk I’ll describe how neuroimaging researchers are using this program to develop a national image sharing and analysis platform for dementia research. I’ll talk about the technologies that are being used to deliver the platform and some of the technical and social challenges that we encountered in developing this platform for its intended community of researchers.
The world is producing more data every year. This allows for more detailed research – if we can utilise it. Dealing with complicated data is hard as looking at tables rarely enables good analysis of the trends behind the numbers. Many projects do this in one of two ways – a complex, customised series of visualisations, or a simple one-size-fits-all Excel-style chart.
PowerBI gives you the best of both worlds. It has a large gallery of built-in visualisations, all of which are available on GitHub and most importantly, can be extended and edited. You can also still create completely custom solutions. It interfaces with a large variety of data storage types, including SQL servers, R scripts and Excel files. And best of all? Most features of PowerBI (enough for the majority of users) are free.
In this talk we will review the use of PowerBI as a visualisation tool, particularly looking at the creation of visuals for the AHRC funded Creative Fuse North East (CFNE) project. This project used PowerBI to design explorative visualizations for the CFNE team as they wrote an important project report. A key issue in the project has been managing the dividing line between public and private data in the survey and we will talk about how the team addressed that using PowerBI.
Python is a very flexible language. This makes it very easy for scientific developers to write code, but can make it challenging as a language for software engineering. This talk will cover some of the things that our team have done to write our project in Python, and some of the issues that we have come across and what we have done to mitigate the risks that that these issues raise.
The ocean forecasting department at the Met Office has been routinely monitoring its forecast quality and comparing itself to international ocean centres for several years. Scientists are now able to understand the impact their changes have on our forecasts. Our users can easily see how we compare to our international competitors. And with our collaborators we can identify issues affecting particular forecasting systems. This ideal situation was not always the case. To get to this point an entire ecosystem needed to be crafted. Data formats needed to be agreed upon with international collaborators to maximise interoperability. Tools needed to be designed with intuitive interfaces to allow new scientists to analyse data. But most importantly modern software development processes, such as TDD, needed to be followed to ensure outputs from the system are scientifically credible.
In this work, I explore the choices, challenges and lessons learned from writing an entire software ecosystem from scratch over the course of several years and how it has slowly become a useful general purpose system for assessing ocean models.
Future High Performance Computers will require applications to run on many thousands of lower power processors and uncertain machine architectures (potentially hybrid combinations of multi-core, many-core and GPUs). This raises challenges for both legacy and new codes in the research software domain as code maintenance and development will require many skill sets encompassing domain specific science, computational science and software engineering. How then can research scientists write parallel code without having to substantially change it each time the machine architecture changes? The UK Met Office is developing a new software infrastructure which hopes to address this by a ‘separation of concerns’ between scientific code, infrastructure code and parallel systems code to enable fast optimisation for different hardware architectures. Our ethos is to enable scientists to write science code without being concerned with the architecture it will run on. We use a 3 layer approach termed ‘PsyKAl’ (Parallel System-Kernel-Algorithm) which is facilitated by a code autogeneration system that takes pure science code (Kernels and Algorithms) and rewrites it applying machine specific optimisations.
Data wrangling is the procedure of accessing, comprehending, and manipulating new datasets. It is important to perform this procedure correctly, in order to minimize confusion and ensure that data is used the way it was intended.
Data wrangling is an arduous process to research software engineers (RSEs) because in addition to manipulating complex and messy research data, RSEs must rapidly familiarize themselves with new tools, packages, as well as the lingo which is second nature to the researchers themselves.
We propose that there are individual tasks which are common to most data wrangling processes. Examples of these tasks range from parsing, securing access to data dictionaries, data integration, and entity resolution. We encounter them with every new dataset, and we respond with customized scripts tailored to meet the peculiarities of each one. These scripts are time-consuming, error-prone, difficult to reproduce, and impossible to reuse.
It is for this reason that we have secured three years’ funding to develop automation tools to address common data wrangling tasks. This talk describes the curation of datasets which will be used for testing the coverage and efficacy of our automated tools.
We present GNU Guix and the functional package management paradigm and show how it can improve reproducibility and sharing among researchers. Functional package management differs from other software management methodologies in that reproducibility is a primary goal. With GNU Guix users can freely customize their own independent software profiles, recreate workflow-specific application environments, and publish a package set to enable others to reproduce a particular workflow, without having to abandon package management or sharing. Profiles can be rolled back or upgraded at will by the user, independent from system administrator-managed packages.
We will introduce functional package management with GNU Guix, demonstrate some of the benefits it enables for research, such as reproducible software deployment, workflow-specific profiles, and user-managed environments, and share our experiences with using GNU Guix for bioinformatics research at the Max Delbrück Center. We will also compare the properties and guarantees of functional package management with the properties of other application deployment tools such as Docker or Conda.
The speaker is co-maintainer and one of the core developers of GNU Guix.
The use of computer simulations in the field of classical molecular dynamics (i.e. solving the Newton Equations for the atoms and molecules present in a well-defined system) is today one the main field of interest in HPC. Real systems are made of millions of atoms and only the use of large cluster allows simulating real applications. Considering the latest impact of the GPGPU on HPC, the porting of the DL_MESO_DPD code, a meso-scale molecular dynamic simulator, on NVidia accelerators is presented. The code has been adapted to the multi threads GPU architecture and the solver completely rewritten in CUDA-C language in order to avoid continuous exchange of data between host and device memory. Moreover, a modified cell list algorithm for the particle-particles forces has been implemented to take most advantages of the SIMT parallelization leading to an overall speedup of ~40 times on the latest NVidia GPU (Pascal) when compared to the original serial version.
Software developers aim for their codes to be efficient with respect to execution time. Normally, little consideration is taken into how double precision mathematics affects this efficiency. For many modern processors, single precision operations take roughly half the time of the same double precision operation. Additionally, storing data in single precision generally gives better cache utilization and reduces data communication costs between processes. In a few application areas, single precision can be directly substituted for double precision and the resulting output is of sufficient accuracy. However, this is not normally the case and care must be taken to use the two precisions together (mixed precision) in a manner that improves efficiency whilst producing the required level of accuracy.
We demonstrate the use of mixed precision within DL_POLY, showing that some code components can easily use single precision; some components can use single precision but code restructuring was required; and, in some components, there was a detrimental effect to the overall accuracy. We show that for large, realistic test problems, mixed precision can decrease the execution time of DL_POLY by at least 15% and maintain overall accuracy.
As a software developer or engineer you usually try to find the perfect tool for the task. A tool which covers all the needs you have. Therefore, if addressed with several tasks to solve, you will turn to the most powerful tools for each task.
At the German aerospace center (DLR) – a large research facility in Germany – we needed to update our basic software engineering tools (software repository (SVN), issue tracker (Mantis), and continuous integration (None)) which were used by 2.000 – 3.000 people. Instead of turning to the most powerful tools on the market, we decided to take a step back and ask the scientists what is most important to them.
In this talk we present our results , the decisions we took based on them, and how the new solution has been received by the scientists.
When it comes to software management, most of us are familiar with a range of tooling to support reproducible research. If we want to ensure we can re-run the exact code used to generate a particular research result, we use Git. If we want to share a particular version of our code with others, we use Github. If we want to publish our software in a way that lets others easily discover it and install any required dependencies, we upload it to one of the many package management systems available (e.g. PiPI, CRAN, NPM). If we want to wrap everything up neatly into a self-contained environment others can easily deploy, we use tools like virtual machines or Docker containers.
However, when it comes to managing data, many of us are less familiar with the available tooling. In this talk I will give an overview of some of the existing tools available to help us consume and publish data in a reproducible manner. I will also discuss some areas where these data management tools fall short of their software management equivalents, and the challenges faced in bridging this gap. Finally, I will provide some suggestions for how we as a community can contribute to improving the data management situation for reproducible research.
This talk will report on research software development approaches to working on system-of-systems model integration within a multi-institution collaborative
ITRC Mistral is a project investigating long-term infrastructure systems planning and assessment, with sector models focussed on energy, water, waste, transport and communications infrastructures aiming to ask questions about capacity, risk and performance under long-term scenarios of socio-economic, technological and environmental change.
My focus, along with colleagues in Oxford, is to enable the integration of the models by developing tools and supporting collaboration. In order to collaborate on model development, we’ve set up version control and reproducible virtual machine environments to run prototype models; we’ve also attempted to approach the shared challenges of integration and model development in an iterative way, starting small, introducing unit testing and integration/sense checking early in the process; and we’ve collaborated in providing some development best-practices training, running a couple of design/hackathon workshops and setting up communications channels for help, chat and catching-up.
One of the problems with the lack of an RSE career path is that it is difficult to gather demographics on Research Software Engineers. It is difficult to campaign for the RSE community if we know little about it. In this paper, we investigate two methods to collect information about this invaluable community.
In January 2016 we ran the first survey of Research Software Engineers, which presented information on the RSE community, from the tools they used, their happiness in their current jobs, their salary and gender split. In April 2017, we repeated this survey, and in this paper we will present the results of our analysis to show a snapshot of the community and compare it to the previous year.
The survey is limited to people who identify as Research Software Engineers, so we conducted a study of jobs.ac.uk to find how many software jobs exist in academia. Early results indicate that around 7% of them are related a position that involves software development. If we extrapolate this over the entire UK research community, there could be as many as 14,000 positions related to software development. We will present our analysis and a comparison between employment conditions for software developers and researchers in academia.
We often talk about research software as distinct from software more generally, because there is anecdotal evidence that there are differences. These include: different incentives for developing the software, different skills backgrounds of the developers, and different funding models.
In this work, I present the results of analysis of software where the code is made publicly available on GitHub. I compare attributes such as contributor community size and health, code metrics, documentation quality, and lifecycle analysis to identify if these perceived differences in the development of research software translate into actual differences that can be identified in the source code repositories. Finally, I suggest areas where research software could be improved based on this analysis.
For over 4 years I am using Puppet and LXC containers to automate deployment of numerous and highly customized Ubuntu desktops I use for my work and to manage systems on my small servers. People usually use configuration management tools like Puppet when they manage clusters of thousands or servers. I want to present you a way of using Puppet when – like me – you manage only few of them. Learning Puppet is productive because it is a mature language with libraries that abstract idiosyncrasies of management of so many components of the Linux system.
So whenever your Ubuntu upgrade doesn’t go smoothly, you don’t have to fear wasting days in setting up everything *again*. Or maybe you want to set up a real cluster farm that does HPC on newest GTX 1080?
Jan Philipp Dietrich (presenter), Lavinia Baumstark, Anastasis Giannousakis, Benjamin Leon Bodirsky, David Klein – Potsdam Institute for Climate Impact Research (PIK), Germany
Bringing structure into data processing workflows in R
Have you ever had to implement a new feature in a collection of data preprocessing scripts all written by different people with different programming styles? In our case a fixed, spatial aggregation scheme had to be replaced by a flexible, user-defined one.
We tackled this issue by creating a more structured workflow with distinct processing steps and clearly defined interfaces and bundled it as open source package. All existing scripts were transferred to this new structure.
The toolkit provides wrappers for downloading, importing, and converting data sources and for subsequent calculations. The wrappers apply checks on the data, create and manage meta data, handle data caching and provide spatial data aggregation.
Beside flexible spatial aggregation this structure provides several other positive effects: It Improves transparency and reproducibility in data processing as code is now stored centrally (in a package) and follows a predefined structure. It enhances quality management due to automated tests after each data processing step and semi-automated metadata creation and management. It simplifies update, replacement and addition of source data and calculations due to its well defined data interfaces.
Containers, such as docker, are useful for providing a consistent environment to compile your code in. I will present some of the ways we are using docker in the FEniCS project (www.fenicsproject.org) to compile our source code, and run unit tests on developer branches before integration into the master branch.
First, we need to build a suitable container, containing all the required and optional dependencies for building our software. The description is kept in a “Dockerfile” which is built automatically online using the quay.io service.
Our main code repository is on bitbucket.org – we can build the project and run tests in our docker container using the “Bitbucket Pipelines” facility, which is hosted by bitbucket. We have been testing other products, such as CircleCI and bamboo (which we host ourselves). With bamboo (similar to Jenkins), it is possible to have a two stage process, where each branch of our library is built into a container, which is then push back to a registry. Subsequently, the unit tests can be run in parallel by pulling this new image and running on several hosts.
Automated branch testing has improved our productivity, and allowed bugs to be picked up earlier, before they get into the master branch.
At some point in your career, you believe, you have finally found your peers. The feeling of finally being “home” can quickly wash away and be replaced with the despair of questioning if you can even remember how to tie your shoes! Working at the higher end of your field, as you progress, you inevitably go from being the smartest person in the room, to just, a person in the room. This leads many to question if they deserve to be in that room at all. This talk aims to explore the feeling of being an imposter and the feeling that you may be out of your depths!