Conquaire Continuous quality control for research data to ensure reproducibility

Conquaire at ELAG 2017 in Athens

The ELAG 2017 conference was hosted by the NTUA from 05-09th June, in sunny Athens. Our project (Conquaire) had applied to conduct a workshop at ELAG on "Automatic quality feedback for inter-disciplinary research data management" to be jointly presented by Christian and Vid.

Before the workshop we (Christian and me) attended the bootcamp Applying JSON-LD to make linked-data-driven applications on Tuesday and the main conference started on Wednesday, 07 June with keynote speeches and speakers. It was fun to see Tory mingle with the attendees who gladly took pics.

We had created a private repo for the attendees to see the demo and understand how we could use gitlab as a research tool to maintain quality of research data. All the attendees were sent a login and provided the workshop material within the repo.

The workshop session was over a period of two days, Wednesday and Thursday with the third day reserved for feedback presentations from the attendees in the main hall. On Wednesday, we had 10 attendees from different countries and work backgrounds. Christian and I had decided that we wanted an open un-conference BoF-like session that kept the discussion free-flowing with notes collected in etherpads. I kick-started the workshop by introducing ourselves, the Conquaire project and mentioned our workshop agenda and outcome: * Learn and gain insights on current research data management (RDM) practices. * Document tools and methodologies currently used within research groups. * Document guidelines on organizing research workflows in various interdisciplinary scientific research projects. * Document the impact of Libre tools on research reproducibility and workflow automation.

Then I requested each attendee to start with a self-introduction and designated a maintainer for each etherpad as we wanted to understand them and their expectations from the workshop. Christian used this opportunity to present the existing research data and publication services run by Bielefeld University Library.

After this, I gave a more detailed introduction to the Conquaire project and the first topic was on "Tools" used by researchers. We had a lively discussion on how researchers use FOSS but dont release their software as the legal department does not want them to loose the IP. Then we discussed the fundamental RDM challenges in interdisciplinary research projects to understand common file formats, research objects, and ontologies to store metadata that was used in research environments and how one could build an infrastructure to cater to this diversity. We quickly ran out of time and had to postpone two topics (data pipeline maintainence tools and common computational services, skills and technology) for the second day of the workshop.

On Thursday, the second day of the workshop I started off by resuming where we had left off yesterday, viz. data pipeline maintainence tools and common computational services, skills and technology. Then, Christian gave a demo of the GitLab infrastructure that could be used for the automation and software maintenance. He spoke about how GitLab cannot only track research data changes, but it can also run automated tests that check data integrity as it arrives – or make sure that analysis scripts still produce the same results. This can be done using GitLab CI or other continuous integrations tools, and he presented a minimal .gitlab-ci.yml file to get people started. Some participants had not used Git before, but they were impressed with the hands-on demo.

After this, we continued the workshop discussion on the impact of dependency hell (constant changes in technology) on reproducibility and most participants found this an interesting aspect as they had not considered this a serious issue. We discussed repeoducibility vis-a-vis the "research freedom", storage issues and what library services (ex. beaker, jupyter notebooks) etc.) could be integrated to manage research data.

Particiants seemed to enjoy the interleaving of slide presentations with feedback rounds and practical demonstrations. On Friday, Alain Borel from the Swiss Federal Institute of Technology, Lausanne, presented his feedback in the main hall. Another researcher, Vasiliki Kokkala, e-mailed their feedback and allowed us to reproduce it here:

It was a pleasure to attend this workshop. I am a postgraduate student and have not worked with the management of research data, but it is an area that interests me. So for me the workshop was useful as an analytic introduction to the subject, and I liked the way the subject matters were constructed in the process. It was also of great profit for me to listen to the other participants' experience of managing research data in an interdisciplinary context. It helped me to get a concrete idea of the practices and the problems that take place in the various institutions.

TIB conference on Software and Services for Science at Hannover!

The 2nd Conference on Non-Textual Information Software and Services for Science (S3) was held from 10-11th May 2017 in Hannover. The event was hosted by the Leibnizhaus, the the home of Gottfried Wilhelm Leibniz (1646-1716), a scholar and polymath from Europe. For a change, Christian and Vid were attending not as presenters, rather as attendees and were treated to an opening welcome addresses by Barbara Hartung from the Lower Saxony Ministry of Science and Culture, Germany. Then, Irina Sens (Interim Director of the Technische Informationsbibliothek [TIB] and Wolfgang Nejdl (Founding Director and Head of the L3S Research Center / University of Hanover, Germany) addressed the attendees.

The Keynote address by Edzer Pebesma from the University of Münster, Germany addressed the topic of "Incentives and rewards in scientific software communities", touching upon the fact that most cited papers are not describing discoveries or scientific-breakthroughs and introduced the O2R project using docker containers to encapsulate data for reuse and sharing.

We had met Konrad U. Förstner at Berlin and it was nice to meet him again and hear him elaborate on "What is good scientific practice for research software?". He spoke about science having a strong and growing dependency of research on software. This required that we ensured quality, accessibility and citabiity, but the problem of lack of awareness, skills, not having enough time not incentives due to lack of dedicated long-term funding ensured that good scientific practices for research software were not adhered to. As before, teh SWC was a shining example as was the SSI.

Then, Nikolaus Forgó spoke about the "Legal requirements for software sharing and collaboration", the Copyright Act and other licensing issues. This was an educational session as technologists shy away from legalese but it helps to learn about the complexity of the legal maze. During the lunch break, I had rare opportunity to pick Benjamin Ragan-Kelley's brains (@minrk, on github) about the cool work he does on Jupyter.

We also had talks on "Managing research software from the perspective of a scientific infrastructure provider" and the DANS perspective on Solid scenario’s for sustainable software which touched upon the stakeholders in research projects. The third session on persistent software referencing, had Daniel speaking about Software citation in software-enabled research as it was not a one-time effort, rather it had to be maintained. He asked people to cite the software itself, not just the paper.

The day ended with other sessions on Workflows for assigning and tracking DOIs for scientific software by Martin and "Software as a first-class citizen in web archives" by Helge Holzmann from the L3S Research Center.

The second day was a half-day session starting with the "The collaborative creation of an open software platform for researchers addressing Europe's societal challenges" session that elaborated on the Data Value Chain Evolution and Semantic data lake.

Neil Chue Hong from SSI (UK) spoke about Software sustainability and had some important guidelines for the selfish scientist. Then Thomas explained how researchers can tidy up the jungle of mathematical models to create sustainable research software. Benjamin gave an introduction to Jupyter and IPython that could be used as a research tool to facilitate open access and reproducible research. The last session was on the much-awaited "Blockchain" that looked at new crypto-graphical methods. The half-day session ended at 12:30H with the talk on "Dsensor.org peer to peer science" by James Littlejohn from the Edinburgh Napier University, United Kingdom

Cataluña Conquaire!

Jochen Schirrwagen and Vid presented Conquaire at the ninth RDA plenary in sunny Barcelona last Friday but first, here is a brief recap from day one!

A pre-conf welcome session for RDA Newcomers was scheduled the evening before the plenary, followed by a networking cocktail for newcomers, EU grant holders and industry participants to mingle.

On day one, 5th April 2017 the RDA 9th plenary meeting session was opened at 9 AM with keynotes and introductions by the RDA Chair, RDA Plenary Coordinator and the outgoing Secretary General. Post the coffee break the WG/IG/BoF Working Meetings started. The programme page lists an array of interesting research discussions related to research infrastructure and community activities aiming to reduce the technical barriers to data sharing and management. Each session was conducted for 90 minutes but with such an interesting array of programme(s), it was hard to choose a single session to attend from multiple parallel sessions.

The IG Domain Repositories session "Introduction to domain protocols for RDM" by Peter Doom and Patrick Aerts discussed Data Management Plans (DMP) that have become mandatory for funders and is now considered a research tool to establish policies on research data management for research labs. The session discussed the importance of a domain oriented approach for DMPs and the nedd to organize communities for standards.

Post lunch, the next session I attended was the 'IG Education and Training on handling of research data RDA 9th Plenary meeting' where Yuri did an introduction on establishing cooperation between projects in the EU and the EOSC pilot, the FOSTER plus project and implementation of open science in H2020. They also spoke about the CODATA WG summer school on Research Data Science, aka, DataTrieste 2016.

After the coffee break, I attended the "IG Data Versioning" session that was about scientific reproducibility and versioning procedures and established best practices for scientific software inorder to enable reproducibility of scientific results.

On day-2, 6th April 2017, the RDA plenary meets began with a "Women in RDA" breakfast from 08:00 - 09:30 AM and it was fun to meet and chat with women from different countries. Thereafter, I attended the "IG Active Data Management Plans" meeting that sought to define, develop and test machine-actionable DMPs. They presented a summary of their workshops and conclusions from CERN, IDCC and other relevant workshops. The presentations outlined specifications for stakeholders, the scope of working groups and next steps.

Next was the "IG Long tail of research data", a session on "Priorities and challenges for managing the long tail of research data". It addressed specific challenges in managing the long tail of research data. They described how organizations can identify priorities and how the community must create standards for RDM of long tail data.

Post lunch, it was the Data Champion Communities meeting on "Towards effective research data training delivery: creating a researcher-led data champion community" where attendees were introduced to the Data Champions project at Cambridge University. The discussion touched upon similar initiatives at other institutions and required actions for an RDA group with outcomes. The second day ended with the Array Database Assessment WG presenting their recommendations & outputs.

On friday, 7th April 2017, the final day of the RDA plenary, my first session was on the working group Data Description Registry Interoperability (DDRI) with a demo of the Neo4J that fetched the ORCID_ID, researcher data, etc...

From 11:00 AM to 12:30 PM it was the presentation from IG Repository Platforms for Research Data where I presented "Conquaire: enabling Git based research data quality control for institutional repositories" which was very well-received - the Q&A session had interesting questions. Also heard other presenters speak about their experiences with research data repositories in their institution and it was a good learning curve to understand their experiences with selecting, implementing, and using specific research data repository platforms/products in various domains.

At 12:30 PM the closing plenary session started with an announcement that the 10th plenary with the theme "Better Data, Better Decisions (BD2)" would be held from 19-21 September 2017 in Montréal at the University of Montréal and Research Data Canada, Canada.

My observations and learnings from the IG Domain Repositories and IG Education sessions.

Some early-career researchers (yours truly included) were awarded EU scholarships from the RDA and as per their individual interest were assigned a few key sessions to monitor. My interest groups were in the IG Domain Repositories and IG Education fields.

IG Domain Repositories:Community-driven research data management: towards domain protocols for research data management

This was my first breakout session that started at 11:30H with Peter Doom (@danskaw) and Patrick Aerts chairing the session that ended at 13:00H. Their meeting objective was to introduce domain protocols for RDM and had three takehome messages:

  • treat data management and software sustainability on equal footing, atleast policy wise.
  • consider and treat data and software as value objects.
  • make stakeholders position explicit by defining their role.

Stakeholders ranged from government funding organizations to societies and other executive organizations. They made and interesting point about why a one-size DMP does not fit every research requirement, rather stressed on a domain-oriented approach, a single template for each domain if you must. This specialized solution would address different sub-disciplinies, thereby making it easier and more suitable to adoption among various research communities.

They also stressed on the need to organize communities for DMP standards, laws and regulations, minimum conditions, templates and support resources. For example, the Domain Data Protocols (DDP) can be openly published with specific protocols for biology, physics, chemistry, cognitive science, etc... I learned how the DDP would make researchers life easy as they can refer to data protocols and raise quality standards. It also diminished the admin burden for researchers when a single approved template model for RDM for their domain existed.

They had identified DDP communities for nine domains and proposed that this granular approach would make it easy for funders to refer to the templates and a framework of protocols would help increase the adoption via the top-down approach rather than the random RDMPs being generated. Thereafter, there was an interesting Q&A session and it was a learning experience in itself.

IG-ETRD Meeting: Joining efforts on delivering Research Data Handling training to wider community

The second breakout meeting was in afternoon from 14:00-15:30H and I attended the "IG Education and Training on handling of research data" session where Yuri Demchenko introduced us to the EU projects that cooperated with updates from spin-off groups. He stressed upon the importance of teaching skills with accreditation and certification for the RDM curriculum.

The EOSC pilot project had defined policy and governance and science demonstrators could prove it works. He elaborated why a postgrad education delivers the needed skill sets. However, mere training does not scale in some other ways that data scientists can demonstrate their work. For EOSC, the WP7 has a framework to inform strategy. The researchers can draw on common data sources and also send out data to collaborate with other researchers, making it interoperable.

Next, the FOSTER plus implementation of open science in H2020 was introduced. This 2-year project with an €900,000 budget would help train the researchers and educate them. There are 11 partners working on building on previous work and facilitating open science training in European research. They would develop intermediate and advanced level training for RDM and Open Research data. The Open Science toolkit also has SWC (software carpentry, data mining) and I was glad I was a trained SWC instructor.

Research data must "GO FAIR", which stands for Global Open FAIR (Findable-Accessible-Interoperable-Reusable) and the need of the hour was for an Internet of FAIR data and services (IFDS). However, this task was too large for one project and it needs collaboration between projects and initiatives. They showed the EDISON services for Core data expert - capacity building and skills management and the slide showing how the training model works was highly interesting. When building a data science team, a data steward plays an important role. Bodies like IEEE and ACM will develop a curriculum for this.

When the topic of MOOC based video lectures came up, it came to light that the drop out rate is very high. They had also collected data on geographic coverage to map learners. The Goettingen library offers meetups with a focus on the OS topics : research data, publishing, and other activities. They had outreach programs via grad-school mailing lists that had monthly hack hour, discussions and helped people to work with Jupyter/ IPython notebooks. They had an ODD (Open Data Day) celebration on 01st March 2017 and planned to do more meetings.

The CODATA WG summer school on Research Data Science held "DataTrieste 2016" based on the SWC curriculum: linux, bash, data analysis in R, etc. The goal was to train the trainers. Thereafter, the 2-min presentations were showcased where Freya presented the Denver breakout group sessions for text and data mining. The University of Cambridge had the Data Champion Communities that was supported by 2 people supporting the training of researchers and had created a community that worked as a champion of communities for data sharing. Their presentation was to be held on Thursday afternoon. There was also a BoF on RDM literacy whose goal was to create better conditions for including RDM courses.

The Q&A discussion also touched upon the fact that FAIR was a western concept and we needed to look at data from a developing country perspective, including cross-domain issues. Sustainabiltity of knowledge, dissemination and sharing knowledge as bottom-up is very important. The entire session was mentally stimulating and during the Question and Answer session, I made some pertinent observations, viz. :

  • In MOOC the content is not CC licensed. Personally speaking, if a learning resource content is not openly licensed for reuse, then we will have to spend more resources in reinventing the wheel, resulting in knowledge silos.
  • Having spoken to professors who were not keen on the additional work burden of creating content for videos, I also mentioned the downsides of MOOC - female professors feared becoming the subject of prank videos on youtube. Both these issues kept them away from MOOCs.
  • My third point was regarding the geographic coverage statistics - this can get fuzzy when a proxy server (ex. Tor) is used to access the MOOC. User privacy was important.
  • With respect to the CODATA WG summer school on Research Data Science (DataTrieste), I was curious to understand if the summer school content was openly shared with other universities, just like the SWC program founded by Dr. Greg Wilson.

Conquaire data managenent plan (DMP)

The Conquaire data management plan (DMP) outlines the data handling of the Conquaire project and is continuously updated.

Conquaire goes the Open Science way to Berlin!

Day-1 : 2017-March-20

On a cold Monday morning, Christian and Vid were two of the many people found trooping into an impressive building[1] in Berlin to attend the Barcamp, a one-day event generously hosted by Wikimedia at their office.

The camp was kick-started with a self-introduction - each attendee had one line with 3-5 hashtags for describing themselves.

Guido handed the stage to "Annekatrin Bock" whose Pecha Kucha talk on "Open Practices in the Classroom" touched upon the Open Educational Resources (OER), the need and the motivation(s) to share/give knowledge freely, and most importantly, the urgent need to improve the retention mechanism for researchers, including the reward and measuring system that will motivate them to stay on.

https://etherpad.wikimedia.org/p/oscibar2017_session7

A total of 22 sessions were proposed and Christian proposed a session on (add link here) while Vid proposed a session on "Reproducibility of Research Data & its Management whose summary is here. Later Konrad from Open Science Radio interviewed Christian [add link].

The session "Tools for open science" had an interesting list of technical tools and suggestions for enabling researchers to use and ensure they do open science.

The barcamp session ended with a discussion on "FOSS in Open Science" by the smart legal eagles from the FSF-EU branch who spoke about harnessing Libre software for Open Science, the opportunities and the challenges researchers face while doing this.

For the barcamp, the hashtag was #oscibar and you can find the twitter account @lfvscience20 that was live (re)tweeting attendees. An ether metapad was used by attendees to store our session proposals.

[1] ...and Berlin's landscape is dotted with many such impressive structures.

Day-2 : 2017-March-21

URL: https://www.open-science-conference.eu/programme/

On Tuesday, Prof. Klaus Tochtermann, from the ZBW – Leibniz Information Centre for Economics, opened the conference with an update on the science ministers meeting on the open science policy platform to manage and govern science in EU.

Then, Jean-Claude Burgelman spoke about the European Open Science cloud (EOSC) that is supported by DE ministry and by 2020, will federate all data infrastructures in an open and seamless (no data locking) cloud initiative. He introduced the OpenScience Monitor, an aggregator for all the science news in Europe.

Prof. Johannes Vogel spoke about the Open Science Platform and the EU Open Science policy that aims to foster open science. Professor Arndt Bode, from the Leibniz Supercomputing Centre, Munich, stressed the importance of Open Science and why it needs federated infrastructures. He introduced GeRDI, an infrastructure to support open and interdisciplinary research from the Leibniz super computing centre.

We heard Professor Jana Diesner, University of Illinois Urbana-Champaign (UIUC), speak about innovating compliantly and transparently, overcoming road blocks with openness and solutions for doing OpenScience.

After a good lunch, the afternoon session was dedicated to the Posters selected to be presented at OSC. It was kicked off with lightning talks by the top 10 posters who got to present their research projects to the attendees. Vid introduced Conquaire and it was well received by the attendees. Konrad interviewed Vid about [research reproducibility](add link].

Day-3 : 2017-March-22

On Wednesday, the final day of the conference, was started by Professor Dirk van Damme, Head of the Innovation and Measuring Progress Division (IMEP), OECD, France, introducing attendees to Open Educational Resources that would be a catalyst for innovation in Education. He raised an important point about MOOCs not being openly licensed which did not fulfill the five R's of OER, namely, + Retain + Reuse + Revise + Remix + Redistribute It was quite evident that content sharing would be enabled by more open IP licensing, like CC licenses.

Professor Thomas Köhler from the Institute for Vocational Education & Media Center, Technical University Dresden, spoke about open educational practices as drivers of educational innovation, and how reuse and remix was crucial for the education sector.

Thereafter, there was a short presentation on "Report of the EC Expert Group on Metrics" followed by a panel discussion comprising of - Professor Judit Bar-Ilan, Department Information Science, Bar-Ilan University (Israel) ; - Professor Isabella Peters, ZBW – Leibniz Information Centre for Economics ; - Dr. René von Schomberg, who represented DG Research & Innovation, European Commission ; - Mr. Benedikt Fecher, German Institute for Economic Research (DIW Berlin) and Alexander von Humboldt Institute for Internet and Society ; and, - Professor Stefan Hornbostel, Institute for Research Information and Quality Assurance (iFQ), Germany.

The panel discussed the challenges : citation gaming and its acceptance by the research community, Bibliometrics, the biases in altmetrics and peer review. There was a lively discussion with audience questions and they agreed that it was important to measure what matters and why Open Science requires Open Metrics where data, content and code is not owned by companies that wont allow us to scrutinize the data.

Then, Lorna Campbell, from the University of Edinburgh, had an interesting presentation on crossing the field boundaries in Open Science, Open Data and Open Education.

Finally, Alexia Meyermann, from the German Institute for International Educational Research (DIPF), presented a report on building a research data infrastructure for educational studies in Germany. She described how only 144 out of 300 research projects gave a response about their research data.

The closing words were from Professor Klaus Tochtermann on the hopeful note of more progress and openness in the scientific world when the OSC was held the next year.

Conquaire presented at Repscience-2016 Workshop in Hannover.

On Friday, 9'th September 2016, Cord Wiljes, Jochen Schirrwagen and Vid Ayer represented Bielefeld University and attended the Repscience 2016 Workshop that was co-located with TDPL2016. The venue was the Congress Centrum near the city Zoo in the historical city of Hannover, a city, surrounded by two small rivers, that was entirely rebuilt after the war and yet it had managed to seamlessly meld the old and new : modern art occupied pride-of-place along the green tree-lined central city roads with an erstwhile palace forming the Leibniz University building.

Jochen chaired the first session and the last (third) session of the workshop in the afternoon and Vid presented the goals and architecture of the Conquaire project at Universität Bielefeld. The workshop day was kick-started with an interesting keynote by Prof. Carole Goble of Manchester University where she ennumerated on the research objects and the 'R' in 'replicable', 'reproduce', 'rerun', 'repeat', 'repair', 'reconstruct', etc.. vis-a-vis the research lab environment from a technical and theoretical perspective.

Jingbo introduced the attendees to the Provenance Capture System at the Australian NCI's HPC centre. This cloud-based solution supports hosts 10PB of research data that uses the PROV Ontology to create a traceable, reproducible, and machine actionable workflow inorder to store and publish the information stored as RDF documents in an RDF graph database using PID services .

The Scholix talk touched upon the interoperability issues for linking datasets, the lack of a cogent literature exchange framework that they attempt to solve by creating a clear set of guidelines for RO, citations, data formats, protocols, etc.. within the OpenAccess framework.

The o2r presentation touched upon the publication process where a researcher can submit a folder and it creates an ERC (Executable Research Compendia) for the existing workspace and also check the metadata. The entire project uses Node.js extensively coupled with Docker containers to package the R-lang workspace for creating base images for the ERC (executable research containers). Similarly, the ReplicationWiki project also uses Docker and VM's for their research workflow automation.

Dr. Dallmeier-Tiessen, an invited speaker, shared her experiences and lessons learned while enabling reproducible research within their research group at CERN which echoed the various issues that most researchers faced. Stefan Pröll from SBA spoke about their Query store used to store queries, parameters, metadata for the data citations and data sets in their small and large scale research setting.

The FLARE project's workflow approach, to tackle the research infrastructure challenges for repeating scientific workflows, was with optimal database management that includes language operators, execution operators and workflow controls for different types of data sources. They plan to extend an existing language such as 'R-lang' to achieve this flexibility in data manipulation.

The workshop ended with a lively brainstorming session where all attendees and speakers exchanged ideas ranging from 'Failure' or negative data that is never published to the lack of guidelines for 'Replication' and 'Reproducibility' in research.

Conquaire Project officially started

Today, Monday, Februry 1st 2016, the Conquaire project officially started.

Interview about our Open Science Q&A portal on Wikimedia Germany blog

Perhaps you noticed the link to Open Science Q&A in this site's navigation and wondered how it is related to this project.

The main goal of Conquaire is to facilitate reproducible research. In order to do this, we will not just produce technology, but we will create this new infrastructure hand in hand with the scholars and researchers who are going to use it.

As reproducible research is becoming central not only to open science but to good scientific practice in general, lots of questions come up because everybody has to learn new things, and some of these have to be discussed first. Instead of creating a closed, local wiki or web forum to document how reproducible research works, we seized the opportunity to host an open, global question and answer website on open science when StackExchange decided it was not interested.

We would like to stress that while the Conquaire project provides a server to host Open Science Q&A, the community there is independent, and Conquaire has no influence on the content.

Readers of German can enjoy the long version of this story in an interview Wikimedia Germany did with our project member Christian Pietsch.

Short project description online

We figured that the project proposal might be a bit lengthy for first-time visitors and finally created an About page.