Conquaire Continuous quality control for research data to ensure reproducibility

My observations and learnings from the IG Domain Repositories and IG Education sessions.

Some early-career researchers (yours truly included) were awarded EU scholarships from the RDA and as per their individual interest were assigned a few key sessions to monitor. My interest groups were in the IG Domain Repositories and IG Education fields.

IG Domain Repositories:Community-driven research data management: towards domain protocols for research data management

This was my first breakout session that started at 11:30H with Peter Doom (@danskaw) and Patrick Aerts chairing the session that ended at 13:00H. Their meeting objective was to introduce domain protocols for RDM and had three takehome messages:

  • treat data management and software sustainability on equal footing, atleast policy wise.
  • consider and treat data and software as value objects.
  • make stakeholders position explicit by defining their role.

Stakeholders ranged from government funding organizations to societies and other executive organizations. They made and interesting point about why a one-size DMP does not fit every research requirement, rather stressed on a domain-oriented approach, a single template for each domain if you must. This specialized solution would address different sub-disciplinies, thereby making it easier and more suitable to adoption among various research communities.

They also stressed on the need to organize communities for DMP standards, laws and regulations, minimum conditions, templates and support resources. For example, the Domain Data Protocols (DDP) can be openly published with specific protocols for biology, physics, chemistry, cognitive science, etc... I learned how the DDP would make researchers life easy as they can refer to data protocols and raise quality standards. It also diminished the admin burden for researchers when a single approved template model for RDM for their domain existed.

They had identified DDP communities for nine domains and proposed that this granular approach would make it easy for funders to refer to the templates and a framework of protocols would help increase the adoption via the top-down approach rather than the random RDMPs being generated. Thereafter, there was an interesting Q&A session and it was a learning experience in itself.

IG-ETRD Meeting: Joining efforts on delivering Research Data Handling training to wider community

The second breakout meeting was in afternoon from 14:00-15:30H and I attended the "IG Education and Training on handling of research data" session where Yuri Demchenko introduced us to the EU projects that cooperated with updates from spin-off groups. He stressed upon the importance of teaching skills with accreditation and certification for the RDM curriculum.

The EOSC pilot project had defined policy and governance and science demonstrators could prove it works. He elaborated why a postgrad education delivers the needed skill sets. However, mere training does not scale in some other ways that data scientists can demonstrate their work. For EOSC, the WP7 has a framework to inform strategy. The researchers can draw on common data sources and also send out data to collaborate with other researchers, making it interoperable.

Next, the FOSTER plus implementation of open science in H2020 was introduced. This 2-year project with an €900,000 budget would help train the researchers and educate them. There are 11 partners working on building on previous work and facilitating open science training in European research. They would develop intermediate and advanced level training for RDM and Open Research data. The Open Science toolkit also has SWC (software carpentry, data mining) and I was glad I was a trained SWC instructor.

Research data must "GO FAIR", which stands for Global Open FAIR (Findable-Accessible-Interoperable-Reusable) and the need of the hour was for an Internet of FAIR data and services (IFDS). However, this task was too large for one project and it needs collaboration between projects and initiatives. They showed the EDISON services for Core data expert - capacity building and skills management and the slide showing how the training model works was highly interesting. When building a data science team, a data steward plays an important role. Bodies like IEEE and ACM will develop a curriculum for this.

When the topic of MOOC based video lectures came up, it came to light that the drop out rate is very high. They had also collected data on geographic coverage to map learners. The Goettingen library offers meetups with a focus on the OS topics : research data, publishing, and other activities. They had outreach programs via grad-school mailing lists that had monthly hack hour, discussions and helped people to work with Jupyter/ IPython notebooks. They had an ODD (Open Data Day) celebration on 01st March 2017 and planned to do more meetings.

The CODATA WG summer school on Research Data Science held "DataTrieste 2016" based on the SWC curriculum: linux, bash, data analysis in R, etc. The goal was to train the trainers. Thereafter, the 2-min presentations were showcased where Freya presented the Denver breakout group sessions for text and data mining. The University of Cambridge had the Data Champion Communities that was supported by 2 people supporting the training of researchers and had created a community that worked as a champion of communities for data sharing. Their presentation was to be held on Thursday afternoon. There was also a BoF on RDM literacy whose goal was to create better conditions for including RDM courses.

The Q&A discussion also touched upon the fact that FAIR was a western concept and we needed to look at data from a developing country perspective, including cross-domain issues. Sustainabiltity of knowledge, dissemination and sharing knowledge as bottom-up is very important. The entire session was mentally stimulating and during the Question and Answer session, I made some pertinent observations, viz. :

  • In MOOC the content is not CC licensed. Personally speaking, if a learning resource content is not openly licensed for reuse, then we will have to spend more resources in reinventing the wheel, resulting in knowledge silos.
  • Having spoken to professors who were not keen on the additional work burden of creating content for videos, I also mentioned the downsides of MOOC - female professors feared becoming the subject of prank videos on youtube. Both these issues kept them away from MOOCs.
  • My third point was regarding the geographic coverage statistics - this can get fuzzy when a proxy server (ex. Tor) is used to access the MOOC. User privacy was important.
  • With respect to the CODATA WG summer school on Research Data Science (DataTrieste), I was curious to understand if the summer school content was openly shared with other universities, just like the SWC program founded by Dr. Greg Wilson.