Share and Flourish - Propositions

See also the meeting agenda and session details.

This page is currently being updated (August 13 2014)

This list propositions is intended to kick-start discussions.
It identifies topics that the workshop organizers have deemed important, but it does not reflect our opinion.
Green and red propositions belong together, green being the point of view that promotes data sharing, and red being the one that puts a brake.
It is up to the workshop consensus to decide which propositions have merit and which do not.

Propositions for "Datasharing utopia"

Intro Recent large-scale projects aim to collect and make available heterogeneous datasets (Human Brain Project (HBP), Allen Institute for Brain Sciences (AIBS)). No individual scientist or discipline has all the necessary expertise to collect all the relevant information or to make sense of these data. There is thus a need for an infrastructure that allows interdisciplinary teams to analyze and interpret multi-scale data and build quantitative models. Such a collection of data lends itself to data mining and discovery of new correlations between data at different length and time scales, that is presently virtually impossible based on data in primary literature. Within the HBP project, these data are to be used for predictive neuroinformatics, completing the severely incomplete data in neuroscience.

The open access of large, heterogeneous datasets is essential for areas in science dedicated to multi-scale topics, such as neuroscience. When such data are not made accessible we miss opportunities to solve important societal problems.
Large databases lead to the risk that science is reduced to massive data mining, without needing the interpretation and guidance of scientists, cutting them out of the loop.
The cost of populating and maintaining is borne by a different group of scientists than those who benefit from it most, undermining the model.

Propositions for "Data producers vs. data consumers"

Data producers should have the chance to get the first scoop from their data. A reasonable time before sharing data is four years.
In the same way that data producers are asked to share data, data consumers should be requested to release code as open source.
Funding agencies should revamp the way that research output is measured, and value the number of times that a shared dataset has contributed to a publication.
Data producers should not control what analysis is done on their data once it has been made available
Data producers remain responsible for checking the correct use of their data

Propositions for "Define data sharing"

To define data sharing, it is key to establish who is sharing what data, when, and with whom. In this session we discuss the various categories that exist for each keyword.

Who owns the data?

The consortium that runs the project
The institute where the data is measured or that ordered the data to be measured elsewhere
The PI who runs the project
Pieces of a dataset can be owned by different entities

By default, the owner of a data set is the institute who employs the PI who is responsible for acquiring the data. Specific terms in grants and contracts may overrule this.
The data owner decides whether the data gets shared and with whom.
As a consquence, if a PI moves between institutes, he needs permission from his previous employer to access 'his' data.
Data sharing starts with a data archive ('Research Data Management)', from which select data sets can be shared.

What data gets stored in a data archive?

All data that has contributed to a publication
All data that has been generated by a project with a well-defined protocol (i.e. population study)
Everything that the successor of a researcher needs to carry on with the project

To proceed with data sharing, subject privacy is a huge issue.
What levels of privacy to we distinguish?

AA No privacy concerns, no humans involved
A Fully anonimized data
B Anonimized data, but patient may be identified if the 'intruder' owns his DNA, biometric face data, or extended demographic data that is relatively easy to find (age, gender, residence, education etc.)
C. Privacy sensitive data, with obvious subject identifiers (name, social security number) removed
D Privacy sensitive data.

The following propositions may be discussed in later sessions, but they depend on the definitions that we agree on in this session.

Research Data Management systems may contain data up to privacy category C.
Restricted access Data Management systems may contain data up to privacy category C.
Public access data may contain data up to privacy category B.

Propositions for "Why insist on data sharing"

Why should funding agencies want scientists to share data? What is their long-term vision and what are their motives? What are the downsides of policies aimed at public data sharing: does it change data quality? What does a re-allocation of the means of funding agencies mean? What does it mean that foreign countries can benefit from data/infrastructure to a greater extent than national researchers?

Each funding agency gives a pitch of five minutes presenting:

Their long-term vision on data sharing
What is the role / responsibility of their organization in achieving this utopia?
Will the organization pro-actively take this role / responsibility (eg data management plan as part of proposal, making data sharing obligatory, allocating funds for data sharing and re-using data, etc)?
What are the downsides of data sharing? Eg, quality of annotation, costs, division of means within and over countries, etc

Propositions for "Multi-center data analysis"

Propositions for "Credits for sharing"

While there is no doubt about their usefulness, the production of annotated datasets is not science in itself. Data publications should have the status of technical reports.
Experimental design, building the required experimental setup and designing a data model are an integral part of the research ecosystem, and can be considered original research contributions.
Scientific results are rated by publications and impact factors. So, the only credit that will motivate a data producer to share is to be a coauthor on each derived publication.
The current ecosystem of methods papers, analysis papers, review papers etc. can simply be extended with data papers. The regular mechanism of citation suffices to give credit to those who produced the data. Only an active contribution to the writing of a paper should be awarded with coauthorship.
An often-used data publication should have the same impact as a nature paper
Sharing data should be an obligation for data that is acquired with public money
Data producers should get rewarded for the release of data by giving them citations and, where appropriate, coauthorship
Authorship credit should be associated with work applied to address the scientific question and not only a result of shared data.
A system should be made available to receive credits for shared data, i.e., how often a data set has been used and the results published (s-index).

Propositions for "Data integrity"

It is the responsibility of the sharer to assure that the data that is shared is well-documented with regards to the data
collection protocols and potential problems with the data.
It is the responsibility of those who use shared data to read any information describing data collection and to perform their own data quality assessment
The quality of data needs to be ascertained before release:
a) yes, quality of data needs to be assessed, quantified, and noted. Use of data of lesser quality should be avoided, only data of a certain quality should be released
b) no, new algorithms may be developed that can deal with certain data issues. Poor data should thus also enter the database
2) the data provider is responsible for the proper use of released data:
a) Yes, users should be checked for 1) proper use of annotations (eg paradigms/markers) and proper application of analysis techniques/algorithms
b) No, the user should have complete freedome to analyze the data. Metadata and their description (eg fMRI tasks specifics) should be adequate for correct interpretation of the data
3) the database manager is responsible for ensuring high quality of metadata.
a) Yes, data in the database should have a minimal level of completeness. Metadata should be checked for entry errors.
b) No, all data are worth entering in the database. With large numbers some loss of metadata can be afforded.
4) should data entry to the database be limited based on quality?
a) only prime data (quality of images + metadata) are accepted for release
b) all data are accepted but quality parameters are added
5) who is responsible for determining data quality?
a) is responsibility of data providers (eg database manager provides procedures and guidelines for doing so – role INCF)
b) is responsibility of data base manager
6) who is responsible for the correct use of data and metadata in analyses and papers?
a) the data provider verifies correct use by reviewing the draft paper (and becomes co-author)
b) The author is responsible for correctly using the data (eg using guidelines, standard procedures – role INCF)

Propositions for "Datasharing and Intellectual Property"

Government agencies are increasingly funding public-private partnerships (PPP) in which companies and university scientists work together on a problem relevant to industry, and for which there is an in-cash contribution from industry. The formulation of such relevant research topics for grant submission already requires protected intellectual property (IP) from the industrial partner. When such a project is funded, long negotiations regarding IP ensue between the funding agency, university and industry partners. As the data collected in such a project represent a competitive advantage for the involved company, they will not want to share it with competitors, thereby preventing public access. A model in which the data are made available after a certain black-out time might be an option for this.

Data collected in PPPs should be exempt from data sharing requirements, but there should be profit sharing in successful projects that supports open science.
The science funding agencies have no role in funding projects that benefit specific companies at the expense of public access to scientific data.

Propositions for "Large scale, long term storage"

Public access to research data requires infrastructure and continued investment to maintain this infrastructure. Furthermore, when patient data is stored, confidentiality issues need to be properly dealt with and adapted to technologies that improve over time. It also requires fundamental research into how to efficiently store and provide access to an expanding database and into tools for analyzing/mining such data.
At the moment there are three parties providing such infrastructure: publishers, through the supplementary data sections and data publications; universities, where the data is stored on servers in the groups where the research is conducted or sometimes more centrally; dedicated organizations/consortia funded by fixed-term government grants (e.g. ESFRI projects). There is a price tag associated with each mode that gets passed on to the user in different ways. When research groups leave (or the PI retires), journals are stopped by publishers, or when the funding stops, the access to these data is lost.

Governments fund national archives to store (paper) documents, but nowadays paper is more and more replaced by electronic data. Providing and maintaining infrastructure is therefore a task of government, preferably in an international collaboration.
In case of a breach of security of a data-sharing repository, the data owner can be held responsible for revealing privacy-sensitive data.
In case of a breach of security of a data-sharing repository, the data sharing facility can be held responsible for revealing privacy-sensitive data.

Propositions for "Automated analysis pipelines"

Propositions for "Ethics of sharing patient data"

The human subject decides whether to what degree his data may be shared. But can he also demand that his data gets shared?
A public sharing options should be included on all consent forms
The researcher has to inform the subject about the implications of data sharing
This workshop should propose a standard consent form for all experiments that explicitly allows for datasharing.
World-wide public sharing of neuroimaging data is only possible with written informed consent from the subject
Anonimized and face-scrambled neuroimaging data can be shared publicly, given that such data no longer identifies the given subject.
Data with unexpected medical findings should be tagged as 'not a healthy control' but not be reported to the subject, given that such knowledge may severely impact quality of life
Reporting of unexpected medical findings may motivate people to participate in neuroimaging studies, and should be promoted as part of the informed consent
It should be possible to withdraw consent. All publicly shared data should be traceable to the original subject to enable data removal
It should be possible to withdraw consent, but this will only affect data that has either not been anonimized or not yet been publicly shared.
Future data mining techniques will make it next to impossible to protect the identity of a subject in the presence of meta data like age, gender, race, place of residence etc. The subject should be informed that there is a serious risk that he will be identified if he consents to the public sharing of such data.
Neuroimaging data is only valuable if accompanied by an extensive set of subject-related meta data, such as age, gender, race, place of residence, smoking, diagnosis, etc. Public sharing of such data is fine, as long as care is taken that individual subjects cannot be identified from this data.
To project the identity of both the subject and the experimenter, only instrument name and parameter settings are to be included with publicly shared data.
Cross-site variability in neuroimaging data is so large, that it is absolutely necessary to retain when, where and by whom the data were acquired.

Propositions for "Public sharing versus restricted sharing"

Public sharing means that data is made accessible to everyone without further conditions restricting the type of analysis that can be performed. In restricted sharing, the data is shared in a consortium with the possibility of making the data
available publicly after a particular black-out time. Under what conditions is restricted data sharing more appropriate than public sharing? Or do we want to move to full public data sharing? Public sharing implies that companies could access data as well, and use it to their competitive advantage. For instance, what if (insurance) companies access presumed anonymized data? Using sophisticated data mining techniques they might even be able to identify individual patients.

The product of publicly financed research is public knowledge. The data should be made publicly available directly after its collection. This includes raw data and metadata that is produced in support of this knowledge in accordance with standards. Examples: Allen Institute for Brain Science (http://www.alleninstitute.org/), Human Connectome Project (http://www.humanconnectomeproject.org/).
Scientists should be able to restrict access to their data (e.g. to within a consortium, or for a certain period of time), so that they have the opportunity to publish articles before the data is made available publicly. Large scientific consortia accumulating data across time, and are therefore supported through different grants and funding agencies. Furthermore, these consortia lose and gain scientific partners over time. This requires clear policies and guidelines to accommodate funder specific regulations and IP regulations of universities, which can only be tackled through restricted datasharing.

Propositions for "Meta data standards"

Propositions for "Integration with non-imaging data"

Can we foresee multi-purpose databases develop in the future, or are the requirements of (dicom) imaging databases and gene sequences too diverse?
Is there a (minimal) meta data standard that we can already implement to accommodate genomic data and behavioral or symptom data?
Extra precautions should be applied for the sharing of genetic data, which is more difficult to anonymize.
Data which cannot be anonymized (i.e., video content) should not be shared with an external site. However, anonymized derived data can be shared.
The sharer should assure that there is an accurate link (subject identifier) between the imaging and non-‐imaging data.

Propositions for "Large scale, long term storage"

The responsibity of maintaining data security at the sharing site is the responsibility of the sharing site.

You are here

Share and Flourish - Propositions

Propositions for "Datasharing utopia"

Propositions for "Data producers vs. data consumers"

Propositions for "Define data sharing"

Propositions for "Why insist on data sharing"

Propositions for "Multi-center data analysis"

Propositions for "Credits for sharing"

Propositions for "Data integrity"

Propositions for "Datasharing and Intellectual Property"

Propositions for "Large scale, long term storage"

Propositions for "Automated analysis pipelines"

Propositions for "Ethics of sharing patient data"

Propositions for "Public sharing versus restricted sharing"

Propositions for "Meta data standards"

Propositions for "Integration with non-imaging data"

Propositions for "Large scale, long term storage"

Propositions for "Database Federation"