Research Data Management Forum 18, Manchester

Monday 20 and Tuesday 21 November 2017 I'm at the Research Data Management Forum in Manchester. I thought I'd use this as an opportunity to try liveblogging, so during the event some notes should appear in the box below (you may have to manually refresh your browser tab periodically to get the latest version).

I've not done this before, so if the blog stops updating then it's probably because I've stopped updating it to focus on the conference instead!

This was made possible using GitHub's cool Gist tool.

This is an attempt at using a [gist](https://gist.github.com) to facilitate liveblogging in a static site. Thanks for joining me for the ride…

The [event programme is available online](http://www.dcc.ac.uk/sites/default/files/RDMF18_agenda_final.pdf). I'll be co-presenting a talk about using the figshare API with figshare's own Megan Hardeman on the Tuesday at 09.40.

Well, I’ve arrived and obtained biscuits and tea.

# Day 1

- Martin gives us the now standard housekeeping slide
- Overview of the programme (see the link above)
- I’m interested to hear about what they’ve been up to at Lancaster with their institutional RDM reporting dashboard 
- There will also be breakout groups tomorrow — I’m sure suggestions for these on the #rdmf18 hashtag will be welcome too, even if you can’t make it!

## Keynote: What are the challenges or Data Science?

*Prof Magnus Rattray, Professor of Computational & Systems Biology/Director of the Data Science Institute, University of Manchester*

### An example: Physics

- Large Synoptic Survey Telescope (LSST): 3.2 Gpixel camera -> 2,000 exposures (= 20TB) per night -> 10 year survey = 100PB data
- Large Hadron Collider (LHC): theoretical output of 68TB/s (!!!) -> about 1.5GB/s to disk -> 200PB total
- Square Kilometre Array will produce more data than can be processed today, but will be curated and analysed over years
- But this isn’t unexpected for physics: it’s being dealt with

### Another example: Geography

- Network analysis of 26m commuter journeys from 2011 census data
- Classify journeys into 9 super-groups and a total of 40 groups
- Individual journeys not interesting, but emerging patterns are
- The tricky stuff is not the machine learning or analysis, but bringing together data from different sources

### Mental health

- Use of wearable devices to track location of people with mental illnesses
- Handle missing data (e.g. due to mobile/GPS blackspots)
- Classify places and activities
- Overlay health status to identify patterns

### Research is increasingly data driven

- Bottom-up modelling: based on assumptions about microscopic principles; develop simulation, run and then compare to reality; refine assumptions
- Data-driven modelling: identify measurable variables; fit a statistical model to data; make inferences and learn about system by identifying hidden variables
- Increasingly connected: mixing “mechanistic” prior knowledge into data-driven models

### Challenges for data science

- Scalability
- Complexity
- Cleaning messy data (missing data, noise, poor formatting, poor/absent experimental design)
- Human data (privacy, ethics)
- Accessibility/availability (openness, reproducibility; e.g. clinicians who protect “their” data to safeguard their future career)

### Example: genomics

- Massive drop in cost of genome sequencing over the last decade
- “It costs more to analyse a genome than to sequence it.” *David Haussler*
- 100k Genome project now collecting a huge number of genomes
- But once you can sequence genomes you can examine much more: transcriptomics, epigenetics, proteomics
- So we can now use this technology to investigate layer-upon-layer of different interacting systems and subsystems
- E.g. asthma
  - Good for a cohort study because a lot of people have asthma
  - Inconsistency and complexity indicate multiple (sub-)diseases
  - E.g. 2 different versions of CD14 gene are associated with different risk levels in different parts of the world
  - Commonly thought to be a progression: eczema -> asthma -> rhinitis
  - Large scale analysis shows this progression only presents in a small fraction of the population: i.e. it is false

### Towards genomic medicine

- 100k Genomes project: 30PB data held securely, restricted access through secure virtual desktop (“Inuvika”)
- Privacy of individuals’ genomes is important but difficult

### Next revolution: scaling down to single cells

- Existing methods effectively take an average of ~10k cells
- As well as looking at large populations of people, we can also go down to individual cell level
- Single-cell methods show e.g. diverse sub-populations in particular cell types
- Each cell is now a high-dimensional data point
- E.g. can trace different mutations through sub-populations of tumour cells
- Profile individual tumour cells circulating in the blood: can diagnose and design a drug regime based on a blood sample instead of an invasive biopsy
- Sophisticated modelling required to disambiguate features of interest from multiple confounding factors

### Dealing with the challenges

- Data volume: move compute to the data (e.g. cloud solutions); will analysis be reproducible in the future, or even across current platforms
- Data analysis: scale up algorithms (e.g. deep learning, TensorFlow); use approximate methods; streaming data processing; clever tricks to avoid computationally-intensive tasks
  - Things that used to be considered “software engineering” (e.g. object orientation, testing) are now important for everything
- Data quality: big data often not collected for a single purpose, so no experimental design
- Robust & reproducible research: record arbitrary modelling choices and vary them to test for robustness; hypothesis selection & p-hacking; keep track of all hypotheses considered (e.g. electronic lab notebook)

### Conclusions

- Research is increasingly data-driven; data science ubiquitous
- Big & complex data: people (especially statisticians and computer scientists) are already motivated to solve these
- How do we motivate people to confront problems of messiness, human data, openness (or lack of)

# Day 2

- Aaaand we're back again for day 2: a full day of content after yesterday's afternoon session

## Case study: CRIS, Research Data & Institutional Reporting

Becky Gordon, Lancaster University

- Research services view on data *about* research
- Work quite closely with library: overlap primarily centred around Pure CRIS
- Systems:
  - HR, student information, costing/pFact, finance → Pure
  - Pure → Departmental webpages, research directory, repository, data management, equipment register
- Reporting
  - Financial reports: monthly (really valued by senior academic staff) & annual
  - Organisational unit performance
  - Individual performance: promotions etc.
  - External requirements: OA, REF, HESA, ResearchFish
- Current project: strategic research management tool
  - Reduce time spent manually generating reports
  - Single hub with live, up-to-date data
- Business questions - want data on:
  - Awards (number, value)
  - Applications (inc. success rates)
  - Impact (publications, OA compliance, …?)
- Process overview:
  - Define data and pull out into a data warehouse
  - Build reports on top of this (using Tableau)
  - Additional internal exception reports to track things that might go wrong
  - Data audit & cleaning
- Challenges
  - Differences in reporting criteria
  - Not enough good-quality data to work with
  - Difficult to make historical comparisons with older reports
- Next steps
  - Continue to produce manual reports & develop tool & Tableau reports in parallel
  - Agree reporting criteria with senior management
  - Ongoing data cleanings

## Case study: data repository APIs

*No updates from me for a while because I’m part of this talk!*

[Our slides are available on figshare](https://doi.org/10.6084/m9.figshare.5616445.v1) (of course!)

## Managing research throughout its lifecycle

Prof Paul Jeffreys, Institute of Cancer Research

- About the IRC
  - 8 diverse research divisions
  - Able to recharge infrastructure costs to research so can fund development
  - Future plans: dynamic adaptive therapy
    - As you treat it in an individual, cancer mutates and evolves so you have to keep changing treatment to keep up
    - Data must be live and online
  - Big data is a key pillar in current strategic plan
- HPC infrastructure
  - 1,800 cores × 12–16 GB, designed for parallel workload
  - Dominated by next generation sequencing; approx 70% usage
  - Jisc data centre in Slough
- Architecture
  - 6PiB provisioned (expandable to at least 20PiB)
  - 2 tier: tier 1 is fast storage (2PiB); tier 2 an object store (4PiB)
  - NAS layer on top so that storage tiers are a black box for users
- Policy-based migration from tier 1 → 2
  - Typically migrated if not used for 90 days, but other possiblities exist
  - Migrated to long-term archive at some later date
  - Most files mirrored across 3 sites; smaller (<10MB) files only 2 sites
  - Object store cannot provide quotas, so charge based on actual usage
- Projects to develop 2 new components for sharing & syncing; also currently using a Dropbox Business service
- Looking for a metadata catalogue solution
  - Many solutions (e.g. iRods, DSpace) aimed at facilities or libraries
  - Need something easy to use for scientists, and off-the-shelf (able to deliver a proof of concept in one person-month)
  - Open to suggestions!
  
## Scaling and empowering cultural change

Shoaib Sufi, Community Lead, Software Sustainability Institute (SSI)

- SSI: national facility since 2010 to "cultivate better, more sustainable research software to enable world-class research"
  - Software development: to build and maintain expertise in software
  - Training: essential software skills for researchers
  - Policy: campaigning for research software support and career recognition/development for research software engineers
  - Community: workshops & fellowship
  - Outreach: website, blog, social media
- [Fellowship programme](https://software.ac.uk/fellowship-programme)
  - £3000 travel/event bursary for people who want to improve research software
  - Funded by support grants from research councils
  - Turns out that "SSI Fellow" is quite a sought-after badge of recognition
  - Fellows = ambassadors
- What makes a good fellow?
  - Strong plan: novelty (for institution/domain); have the skills/experience to succeed; will make a difference
  - Content: demonstrate ability to create impact
  - Communications skills
- Typical activities
  - Workshops/conferences/training (including tailored carpentries)
  - Promote SSI and contribute to its success
  - Contribute to SSI blog
- Some amazing lasting outcomes from the fellowship programme
  - Development of services (Melody Sandells)
  - Contribution to RSE conference & organisation (Alys Brett)
  - Library Carpentry (James Baker)
  - recipy workflow management software (Robin Wilson)
  - Open source versions of common commercial research software (Robin Grant)
  - Data science for doctors training (Steve Harris)
  - Establishing reproducible research as standard in a major research group (Stephen Eglen)
- Conclusions
  - The right people to effect change are *in the research community*
  - Need support and community
  - Cross-pollinate ideas across different domains
- [Collaborations Workshop 2018](https://www.software.ac.uk/cw18/) focus on themes of Culture Change, Productivity, Sustainability

## Lunchtime!

And now it's time for lunch, but after that there will be three parallel breakout groups:

1. Supporting resources for RDM: toolkits & workflows
2. Integrating data systems & cataloguse
3. Impact & metrics: reporting & evidencing success

## Breakout group feedback


### 1. Supporting resources for RDM: toolkits & workflows

This includes some information from surveys and interviews around the Jisc research data toolkit project.

- Presenting content through journeys is a useful approach
- If available, quite a lot of people would use resources in a RDM toolkit to augment their teaching
- Preferred mechanism would be working group of HEI-based RDM professionals with Jisc support
- Interesting possible features: institutional subdomains with customisable content; CC-BY license; funder policy summaries; regular newsletters

### 2. Integrating data systems & cataloguse

- Important themes: ownership, provenance, privacy
- Audit trails important, but 

### 3. Impact & metrics: reporting & evidencing success

Comments

Comments powered by Disqus