Workshop Abstracts

Select the workshop title to display its abstract and advised prerequisites.

Hands-on workshops

Scalable High Performance Computing in the cloud
Ilias Katsardis

Gain on-demand elasticity and consistent performance by extending your High Performance Computing (HPC) workloads to Google Cloud. In this session, learn how HPC clusters can be created on Google Cloud Platform by utilizing Google Compute Engine VMs and Google Cloud Storage. Running HPC workloads in Google Cloud enables you to augment on-premises HPC clusters, or run all of your HPC in the cloud. We’ve also announced collaborations with several HPC solutions (such as SchedMD’s Slurm, Altair’s PBS and Adaptive Computing’s Moab) to help meet the increasing HPC and data analytics needs for your applications. After attending this session, you’ll have a comprehensive understanding of Google’s underlying network architecture, flexible pricing, elastic resources such as the latest GPUs and TPUs, and storage options to extend your HPC capabilities.

Parallelising your MATLAB Analytics
Rory Adams

You’ve developed a great piece of analysis, but how do you now make it run faster or ensure it scales with the growing number of the scenarios and data volumes? The solution is typically parallelisation.

In this workshop you will be introduced to parallelisation constructs available in MATLAB. You will learn basics like parallelising a for-loop, to more advanced practices like asynchronous execution, communicating between process for reporting intermediate results and optimising data transfer overheads. You will also discover how easy it is to scale by leveraging clusters, whether local or cloud based.

Recommended prerequisites
Introductory knowledge of MATLAB

Machine Learning at Scale on AWS with Amazon Sagemaker & Open Data
Brendan Bouffler

The development and application of machine learning models is a vital part of scientific and technical computing. Increasing model training data size generally improves model prediction and performance, but deploying models at scale is a challenge.

Participants in this workshop will learn to use Amazon SageMaker, an AWS service that simplifies the machine learning process and enables training on cloud stored datasets at any scale. With Amazon SageMaker, users take their code and analysis to the data using familiar data science tools (Jupyter notebooks), learning frameworks (MXNet and TensorFlow), and easy to use SDKs for Python and Spark.

Attendees will walk through the process of building a model, training it, and applying it for prediction against large open scientific datasets such as satellite imagery and 1000 Genomes data. By the end of the session, attendees will have the resources and experience to start using Amazon SageMaker and related AWS services to accelerate their scientific research and time to discovery.

Recommended prerequisites
Familiarity with ML principles and experience using Jupiter notebooks

Getting more Python Performance with Intel® optimized Distribution for Python
Jim Cownie

This workshop will introduce the background of the Intel distribution and why it is faster than a plain vanilla Python distribution running on Intel Architectures, and guide the audience with examples on how to apply Python applications and libraries most efficiently resulting in fast applications.

Initially, the audience will learn the most basic steps to install and run an Intel optimized Python distribution. Then the audience will be guided to apply techniques on how to get the best performance out of Python. The audience will use Intel libraries (e.g. pyDaal) and examples from classical machine learning field in order to gain deeper insights in performance optimization.

Recommended prerequisites
Basic knowledge of programming and ideally also Python language

Lean Tools for Product Development
Ian Mulvany

Software can quickly take on a life of its own, new feature requests, thinking about refactoring, bugs to be squashed. Given that the only bug free code is no code, as soon as you start committing you are committing to one path over another.

How might tools from lean product development help in prioritising feature development? How might they help with thinking about sustainability models for software projects?

Lean product development is a practice that evolved from the Toyota production system and it aims to eliminate waste in systems by:

Efficiently managing inventory (backlog of feature requests)

Understanding process flow (lead time and cycle times)

Managing risk through reducing cycle times for experimentation.

In this workshop we will look at a hierarchy of tools that can help with thinking about prioritisation questions, from project level down to specific feature level.

We will do a deep dive into a couple of these tools.

The lean value tree can help to orient broader program efforts. The lean canvas can help to identify the largest risks that a project can face.

Working with the group we will solicit some real use cases to use for the workshop, and attendees will then be encouraged to try the tools out on their own projects.

We will be using post-it notes and sharpies for the workshop, no special prior knowledge is required.

Recommended prerequisites
An interest in lean processes is helpful, but not required.

An introduction to Julia
Valentin Churavy, JuliaLab@CSAIL, Massachusetts Institute of Technology

Julia is a programming language that was designed to avoid the “N+1
language” problem currently plaguing the scientific community, where high-level scripting languages are preferred by domain experts for productivity, but low-level languages are required for high performance and access to hardware-specific features.

In this workshop I will introduce participants to Julia, with the goal to explore how Julia manages to be both a high-level, easy to program language while not leaving performance on the table. This allows for productive collaboration using a single source language.

Participants will also be learning how to call other languages from Julia and how they can call Julia projects from their existing codebase.

Time permitting I will briefly explore topics on how to do benchmarking, performance measurements, distributed computing, and GPU programming in Julia.

Recommended prerequisites
Programming experience in another language

Using Singularity for High Performance Computing
Mihai Duta, Diamond Light Source
Andrew Gittings, University of Oxford

The use of containers is one of the hot topics at the moment in high throughput and high performance computing. Container technology, initially designed for micro-service virtualisation, has found an expanding role in scientific computing for at least two reasons:

containers allow a relatively easy management of software processing pipelines (often complex and with restrictive software dependencies) in a way that is independent from the computing platforms on which they run and

a consequence of the above, containers facilitate the reproducibility of computing results.

Designed with a focus on high performance computing, Singularity is arguably the container of choice for the data centre in a scientific facility. In a one-off effort, Singularity can package a “difficult” application with complex dependencies and a “temperamental” installation, creating a portable, shareable, and re-usable job environment. With this, Singularity offers the combined advantage of reproducible runs and easy control of the software stack at a price of only a modest decrease in performance relative to “native” runs.

This workshop will give participants an understanding of Singularity’s mode of operation and of what makes it particularly suitable for reproducible research. The topics covered using hands-on examples will be

the concept of containers and an introduction to Singularity;

creating and boostrapping Singularity container images;

pulling and running pre-packaged images from Singularity Hub;

converting Docker images to Singularity;

managing data: creating, importing, exporting, mounting, etc.

To enable participants understand the function of Singularity in scientific computing, the following two topics will also be discussed

running MPI-distributed applications and enabling GPUs;

Singularity integration with cluster resource managers.

Recommended prerequisites
Intermediate experience of using Linux.

Data vizualisation with Shiny
David Mawdsley, University of Manchester
Louise Lever, University of Manchester

R has excellent and extensive data vizualisation facilities. Shiny lets us make these interactive, by creating web-based vizualisations within R. These can be deployed using Shiny Server, or used locally within R. Shiny is useful for exploratory work, prototyping, and production use.

In this workshop we will show users how to create an interactive vizualisations of tabular data. Inspired by Hans Rosling’s excellent data vizualisations we will use the well-known gapminder data to motivate our the examples in our workshop. We will cover making graphs with ggplot2, using Shiny widgets to allow the user to customise the graph, and making interactive graphs. By the end of the workshop participants will be able to create simple, but extensible, Shiny apps.

Recommended prerequisites
A working knowledge of R (especially ggplot2) would be very useful for attendees.

Lightweight data management with dtool
Tjelvar Olsson, John Innes Centre
Matthew Hartley, John Innes Centre

The explosion in volumes and types of data has led to substantial challenges in data management. These challenges are often faced by front-line researchers who are already dealing with rapidly changing technologies and have limited time to devote to data management.

There are good high level guidelines for managing and processing scientific data. However, there is a lack of simple, practical tools to implement these guidelines. This is particularly problematic in a highly distributed research environment where needs differ substantially from group to group, centralised solutions are difficult to implement and storage technologies change rapidly.

To meet these challenges we have developed dtool, a command line tool for managing data. The tool packages data and metadata into a unified whole, which we call a dataset. The dataset provides consistency checking and the ability to access metadata for both the whole dataset and individual files. The tool can store these datasets on several different storage systems, including traditional filesystem, object store (S3 and Azure) and iRODS. It includes an application programming interface that can be used to incorporate it into existing pipelines and workflows.

The tool has provided substantial process, cost, and peace-of-mind benefits to our data management practices and we want to share these benefits. The tool is open source and available freely online at http://dtool.readthedocs.io.

Recommended prerequisites
Basic understanding of the Linux command line.

The Hitchhiker’s Guide to Parallelism with Python
Declan Valters, University of Edinburgh

Classical thinking on Python is that it does not support parallel programming very well, and that ‘proper’ parallel programming is best left to the ‘heavy-duty’ languages such as Fortran or C++.

This workshop will challenge that assumption through a hands-on tour of the options that are available for parallel programming with Python. It will start with a brief introduction to the issues that have given rise to this traditional thinking, namely the Global Interpreter Lock (GIL), and why this currently prevents traditional thread-based parallelism in Python.

The workshop will then cover in a series of hands-on exercises some of the alternative approaches to parallelising problems using Python libraries including:

The ‘multiprocessing’ module for simple process-based or task parallelism.

Auto-parallelisation of numeric codes with the ‘numba’ library

Message Passing Interface (MPI) approaches to parallelism with the `mpi4py` library.

A look at how to circumvent the GIL issue with the Cython library and OpenMP.

Finally we will compare and contrast the various Python parallelisation approaches covered in the workshop, with a discussion on how to chose the right approach for a particular software problem, and when it might be appropriate to look beyond Python for parallelism if necessary.

It is hoped that the workshop will give attendees a broad knowledge-base of Python parallelisation approaches which they can take home and apply to their own research and scientific Python codes. Attendees will be encouraged to share their own insights in the discussion session at the end.

Recommended prerequisites
A basic knowledge of Python is required; some understanding of the basic concepts of parallelism would be useful, but not essential.

Make testing easy with pytest
Matt Williams, University of Bristol

In recent years, pytest has become known as the foremost testing package for Python. This session will cover some of the advanced features of the tool to allow more reliable, extensive testing.

It will cover the built-in pytest fixtures such as xfail, skip and skipif to control which tests are run and where. Pytest’s custom fixtures are also a powerful tool to setup and teardown your test environment without too much boilerplate so they will be introduced and explained. Mocking and mokeypatching have become a popular way to isolate parts of your system while writing integration tests and pytest provides a built-in solution to make life easier; both ‘when’ and ‘how’ will be covered. Finally, I will introduce Hypothesis as a tool for automatically generating test cases and finding minimal reproducible cases.

Recommended prerequisites
Must have an understanding of the Python programming language.

A tried-and-tested workflow for software quality assurance
Mark Woodbridge, Research Computing Service, Imperial College London
Mayeul d’Avezac, Research Computing Service, Imperial College London

There is an ever-growing number of tools available that aim to encourage best practice in software engineering and thereby improve software quality. How do you choose a combination of these tools and automate their operation such that they work for you, rather than vice versa? In this hands-on workshop we’ll guide attendees through the process of building up a project from a single Python file to a fully-working quality assurance (QA) setup including static code analysis (code formatting, syntactic verification, type checking etc), multi-environment regression and unit testing, and performance analysis. This verification process will be fully automated using Git and a continuous integration (CI) server. Our selection of open source tools is informed by experience of collaborating on large-scale software projects with researchers at Imperial College London. We’ll be focusing on Python but will also discuss how similar approaches to QA can be used with other languages, and in HPC and cloud environments.

Recommended prerequisites
Basic working knowledge of Python and Git.

Building and Deploying Custom JupyterHub Images using Docker and Kubernetes to run Workshops in the Cloud
Christopher Woods, University of Bristol
Kenji Takeda, Microsoft Research
Lester Hedges, University of Bristol
Antonia Mey, University of Edinburgh

Jupyter provides an excellent interactive environment for running workshops. While there are many free services that let you explore Jupyter, you will need to run your own JupyterHub server if you want to use a custom image that includes your own software, if you want more cores than are provided by the free service, or you want to run a workshop with large numbers of attendees. Building and deploying your own JupyterHub using Docker, Kubernetes and the Cloud is very easy, and this workshop will show you how. You will build your own Docker image, create your own Cloud Kubernetes cluster, and will then deploy JupyterHub to this cluster using Helm. We will also provide tips and tricks we’ve learned from running Jupyter workshops ourselves. So, in short, this is a workshop in which you will learn how to build and run your own workshop 🙂

Recommended prerequisites
You should be comfortable using the Linux command line, should have a very basic understanding of what Docker (or containers) are, and have some knowledge of what Jupyter is (we will provide background reading on Docker and Jupyter, and will teach you about Kubernetes, JupyterHub, and deploying these to the cloud).

Discussion workshops

Inclusive RSE Hiring Practices
Matthew Johnson, Camilla Longden

Hiring good people is never easy, but hiring for RSEs can often feel twice as hard because you are evaluating two disparate skillsets at the same time! In this workshop we will share our experiences with hiring RSEs in Microsoft Research, share some of our techniques, and provide ample opportunities for discussion around the experiences that attendees have had on both ends of an interview during their RSE careers. Our particular focus will be on how we can approach hiring in our field such that it is a more inclusive process that becomes a vehicle for increased diversity of all kinds in our teams. Specific activities will include: demonstrations of different phone screen technologies, an interview question sharing and brainstorm session, a best and worst interview stories discussion, some tips on inclusive practices and interview techniques, and if time allows a few mock interviews with discussion and analysis.

Recommended prerequisites
None needed aside from a willingness to share experiences and an openness to learn from others. Those who have extensive experience interviewing (or being interviewed!) are encouraged to attend so as to share their knowledge and experience with others.

Using Distributed Deep Learning and Transfer Learning Techniques to Solve Real-World Problems, From the Detection of Retinal Disease to Credit Card Fraud

David Ellison

Armand Vilalta-Arias

As of today, deep learning (DL) networks are the most powerful representation learning techniques in the field of Artificial Intelligence. However because it is computationally expensive to train these networks, it is infeasible and unnecessary to train a DL model for every specific problem one may want to solve. In that regard, Transfer Learning (TL) is the sub-field dedicated to the reuse of pre-trained neural networks to solve new problems. In this workshop, we will introduce the concepts of distributed training, DL and TL. Then we discuss the use of TL to classify five forms of retinal disease to achieve state-of-the-art accuracy while balancing time-to-train performance. After this, we will explore the how to implement DL and TL using other real-world examples. While demonstrating our DL and TL solutions we will host an interactive discussion to explore different possible approaches. Finally, in this workshop we will summarize our generalizable findings regarding effective TL experimental design and scalable deep learning.
Recommended prerequisites
Familiarity programming, basic understanding of deep learning techniques

Improving research workflows by enabling high-speed data transfers
Tim Chown, Jisc
Chris Walker, QMUL
David Salmon, Jisc
Duncan Rand, Jisc/Imperial College

This workshop will explore how RSEs whose work requires transfers of significant volumes of research data between organisations can learn to better articulate the requirements for such transfers, understand issues affecting data transfer performance, discover more about existing best practices, and discuss how best to engage and work with their local computing service to turn that theory into practice.

Notes from the meeting can be found on Google Drive.

The format of the workshop will be a series of brief presentations by multiple presenters to lead into a number of discussion topics.

The discussion topics include:

Your rationales for moving research data, how you can articulate requirements, get a feel for theoretical numbers (e.g. 1TB is 2Gbit/s for 1 hour) and how those relate to typical campus capacity. Thinking about ‘future looks’ on requirements;

What factors you think might affect transfer performance; your experience with transfer tools, a demonstration of TCP throughput theory;

Which tools do you currently use to diagnose data transfer issues? From ping and traceroute, through MTR to perfSONAR. We plan to include some hands-on inspection of perfSONAR measurement meshes;

Science DMZ / RDTZ principles and best practices; campus network engineering, data store tuning (DTN), network performance measurements;

Engagement with local computing services; how/when you do it today; doing so in the context of improving data transfers;

How to pick the right data transfer tools;

Discussion of a materials science case study between Diamond and Southampton as an example of what can be achieved; going from using physical media to achieving 2-4 Gbit/s over the Janet network.

Recommended prerequisites
Attendees should have an interest in learning about how they can move large data sets more effectively between data capture, compute, storage or visualisation facilities at different university / facility sites on Janet or beyond, and how doing so might improve or allow innovation in their research workflows. No prior knowledge of networking is assumed.

Implicit none? Does implicit bias affect the careers of women RSEs?
Catherine Jones, STFC
Joanna Leng, Uni Leeds
Kirsty Pringle, Uni Leeds
Tania Allard, Uni Leeds
Alys Brett, CCFE

Implicit, or unconscious, bias refers to the attitudes or stereotypes that affect our understanding, actions, and decisions in an unconscious manner. For example, subconsciously, more people associate men with technical skills and women with caring skills than vice versa. Research has shown that knowing the gender of candidates can affect the outcome of job applications negatively for women. Unconscious bias is a normal part of human psychology, but if left unchallenged it can reduce diversity in the workplace. This practical workshop aims to raise awareness of implicit bias and discusses how we can act to reduce the effect of implicit and unconscious biases within the RSE community. It is part of the community and careers theme.

The workshop will start with an introduction to the subject, before the participants split into smaller groups to do a variety of activities to demonstrate and challenge their own implicit and unconscious biases. The group will come back together at the end to share insights amongst the whole workshop.

The workshop is open both men and women, and people of all career stages.

Recommended prerequisites
No specific knowledge & skills but an interest in gender-neutral career development.

Building and Running an RSE group: An Open Discussion on the Challenges
Paul Richmond, University of Sheffield
Simon Hettrick, University of Southampton

Invited panel:
Robert Haines, University of Manchester
Ciaran McCormick, Open University
Owain Huw, Cardiff University
John Owen – University of Birmingham
Steven Manos – Melbourne University
Alys Brett – Culham Centre for Fusion Energy
Carina Haupt – Software Engineering Group @ DLR
Find out more about the panel members

A large number of research institutions now have successful RSE groups. The model for building and expanding groups differs from institution to institution, often a result of political pressures or the host environment. Within this open discussion Sheffield RSE will start by giving a short overview of the process of building their group, with other groups given the opportunity to share their own experience. In particular, groups will have the opportunity to share the challenges that have been faced and highlight ongoing operational difficulties. The session will provide an open dialogue to invite other groups to share their experiences to promote administrative and procedural solutions for existing groups, and for people considering starting a group of their own. Topics such as underwriting of staff, approaching faculties and university boards and growth plans will be encouraged.

Recommended prerequisites
None.