DSPG 2021 - Data Reuse Project

Fostering Data Reuse: Measuring the Usability of Publicly Accessible Research Data

 

Project Sponsor: Martin Halbert, Senior Advisory for Public Access, National Science Foundation (NSF)

Project Collaborator: Sarah Nusser, Professor, Department of Statistics, ISU

DSPG Team Leader: Adisak Sukul, Associate Teaching Professor, Department of Computer Science

DSPG Graduate Fellow: Tiancheng Zhou, Graduate Student, Department of Computer Science

DSPG Interns: Sonyta Ung, Iowa State University, Computer Science (B.S.); Jack Studier, Iowa State University, Community and Regional Planning (B.S.); Saul Varshavsky, Drake University, Computer Science (B.S.), Data Analytics (B.S.), Artificial Intelligence (B.S.)

Additional information: This is a collaboration with another University of Virginia DSPG team.  Their faculty lead is Alyssa Mikytuck, with Gizem Korkmaz as collaborator [link]

 

Project description / goals:

In what ways can researchers figure out whether their data is being reused?  Can there be some metrics that convey aspects that promote higher data reusability?  In essence, what should be considered when investigating how data can be highly reusable?

In this project, our mission is to study factors that help us better understand what makes data highly reusable.  We will explore and dive deep into various metrics from repositories that may affect data reusability.  Here is a picture that gives an overview of our specific goals for this project:

 

Figure 1: Specific goals for our project (this image was created by Aditi Mahabal, who is on the University of Virginia DSPG team)

 

What Data is Being Used for this Project, and Why is it Relevant in Scientific Research: 

This project specifically dealt with publicly accessible research data, which is data that is accessible to the public mainly through repositories.  Publicly accessible research data is one of the best tools available for helping scientific researchers propel their research process, because publicly accessible research data allows scientific researchers to easily be transparent about where they get their information from to support their research.  Our client for this project was the National Science Foundation, and our mission was to convey insight to them as to what factors lead to higher data reusability for publicly accessible research data.

Methodology / Our Approach:  

Using the Data Science Framework as a guide, the research is focus on:

  1. Discover, inventory, profile, and document a set over 200 repositories and what information they provide about the reuse of a data source. Develop a data source with information and associated metadata such as number of citations, size of repository, number of views, number of downloads, and reusability Scores) .
  2. Evaluate the reuse information derived from repositories, including what that information signals about reuse on a shared data source and the quality of the repository information on reuse.
  3. Critique the usefulness of the various reuse metrics, recommend what metrics are useful and for what purpose, and suggest metrics that should be developed if appropriate.
  4. Using the Python working environment (BeautifulSoup, Selenium, and ReGex package) for HTML Scraping, and Python graphical plots and Google Data Studio for data visualizations, then applying statistical analysis to those graphical plots and data visualizations.

Standard Definitions of Metrics Scraped From Each Repository:

The standard metrics we have encountered that helped us analyze ways data is highly reusable are views, data downloads, citations, and reusability scores.  Views can be defined as the number of times a given dataset (from a repository) was viewed by other researchers.  Data downloads are referred to as the number of times a given dataset was downloaded for use.  Citations are represented as the number of times a given dataset was cited as evidence by other researchers.  Finally, reusability scores are indications of the extent towards which a given dataset is considered highly reusable.

How do these Standard Metrics Help Us Better Understand the Process of Making Data Reusable?:

We have observed that these standard metrics demonstrate how the process of data reusability can best be represented as a cycle.  In order for data to be reused, the researcher must initially search for relevant data and later view it.  Then, if the researcher is interested in exploring the data further, they must download it.  Finally, if the downloaded data was used to support the researcher’s various claims and statements, the researcher must cite the data source to ensure transparency and credibility.  It’s important to note that this is just a simple explanation of what the process for making data reusable looks like.  However, there are various repositories that have other metadata, not just these standard metrics.  Below, you’ll find all of the repositories that were scraped by the Iowa State University team (if you are interested in knowing the repositories that the University of Virginia team scraped, link).

 

Results from each repository: 

  1. MorphoBank  (by Tiancheng Zhou)

MorphoBank is a web application for collaborative evolutionary research, homology of phenotypes over the web, and it’s a database of peer-reviewed morphological matrices.  This database is supported by the National Science Foundation (NSF), the American Museum of Natural History, and Phoneni Bioinformatics. 

There are 932 publicly accessible projects as of July 29, 2021, in MorphoBank. Publicly available projects contain 142,902 images and 594 matrices. MorphoBank also has an additional 1,488 projects that are in progress.  These include an additional 183,240 photos and 1,273 matrices.  These will become available as scientists complete their research and release these data.  Each project has details regarding number of views and number of media downloads, which are the associated pictures.  

There’s also details regarding the number of matrix downloads, such that a matrix is a table of taxons that shows if each taxon has uncensored and censored cells, as well as some other biological features.  Hence, the matrix shows the differences between each taxon in the project.  After extracting these metrics, I created a correlation table and a plot, and I found no strong correlation between any metrics.  

The one that has the highest correlation is between project views and matrix downloads, which is about 0.34, so it might tell us that people usually download the matrix when they open and view a project to reuse the data.  Still, we are more confident to say that the producers of the dataset might benefit from knowing that number of media and matrix downloads as a measure of impact on how reusable the data is demonstrates how people usually reuse the data based on looking at the status of the metrics that are provided by the repository. 

Figure 2: Correlation Table

Figure 3: Correlation plot between project views and matrix downloads 

  1. Global Biodiversity Information Facility  (by Tiancheng Zhou)

Global Biodiversity Information Facility is an international network and data infrastructure funded by the world's governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth.  This repository contains 61,292 public accessible datasets and are used by 6,048 peer-reviewed papers.  I have extracted matrices from about 3,300 datasets; we have the number of occurrences, which are the events in the world that got recorded; the number of downloads and number of citations from each dataset.  After I scraped the matrices, I created a correlation table and plot again.  This time, the correlation table and plot show a strong correlation between the number of downloads and the number of citations, which is about 0.82.  Hence, we can be pretty confident to say that there is a decent relationship between downloads and citations, which is also an indicator of impact for the producer’s data source, showing how people manipulate the datasets to reuse them.  And since this repository has a large number of datasets, which are also publicly accessible and reusable, citations and downloads would reasonably be the factors that have some impact on making the datasets reusable if we can investigate further.  And lastly, since we have a very low correlation between occurrences and other metrics, we don’t see how this metric has any impact on data reusability. 

Figure 3: Correlation Table

Figure 5 : Correlation plot between number of downloads and citations 

  1. Astrophysics Data-System  (by Saul Varshavsky)

The Astrophysics Data-system Repository (ADSR) has a total of 639 datasets that were each individually scraped.  Specifically, I scraped the number of citations, reads, downloads, and references from every dataset.  The citations represent how many times a given dataset was cited as evidence by a different researcher, reads represent how many different times a given dataset was read (explored), downloads represent the number of times that a given dataset was downloaded for use, and references represent how many citations were referred to when collecting data for a given dataset.  As I scraped the ASDR, I have encountered two major successes.  The first success was being able to consistently scrape the same types of metrics from each dataset, which were the number of citations, reads, downloads, and references in this case.  The second success was being able to discover a strong correlation between the number of times a given dataset was read (explored) vs. the number of times that a given dataset was downloaded.  We can infer that this may be the case, because we can also see that reads and citations have a stronger correlation than reads and downloads.  Reads and citations have a stronger correlation, because it appears that the more a dataset is cited by other researchers, the more that given dataset must have been read (explored).  Before a dataset can be downloaded, it must be read, which supports our inference that citations and reads are more strongly correlated, since they are more directly influenced as opposed to citations and downloads.  Additionally, since a dataset must be read before being downloaded, this demonstrates how reads and downloads can also have a direct influence on one another, which may explain why reads and downloads have the strongest correlation with one another compared to other variables.  Refer to the visuals below for reference:

Figure 6: Linear regression scatter plot on downloads vs. reads

Figure 7: Correlation matrix showing the correlations between citations, reads, downloads, and references

  1. NeuroMorpho  (by Sonyta Ung)

NeuroMorpho is a centrally curated inventory of digitally reconstructed neurons.  It contains contributions from hundreds of laboratories worldwide and is continuously updated as new morphological reconstructions are collected, published, and shared.  The goal of NeuroMorpho is to provide free access to all available neuronal reconstruction data in the neuroscience community.  Throughout the research, the repository has some interesting features such as this repository allows users easy access and there are clear instructions API, repository’s structure and detail, up-to-date information, as well as convenience when it comes to contributing data.  The NeuroMorpho.org repository has about 5,000 publication uploads and 377,000 neurons.  There are 1,681 data availability which has 176,847 neurons.

Figure 8: Whisker plot on the number of citations by year

This plot shows variaty of citations of each publication by year.  Most of the publications have been cited more than one up to hundreds times.  There are some outliers that make the median of these citations have variation.

Figure 9: Bar plot on average of citations by year

The formula to compute the average of citations is the total number of citations divided by total number of publications by year.  If we look closer, the average number of citations from 2005 to 2020, shows that the total number of publications makes the average citation of publications that are published for a longer period of time have higher average citation than recent published years.  This brings up an important question: is the number of citations correlated to the number of publications?

  1. Kaggle (by Jack Studier)

Kaggle is an online community of data scientists and practitioners of machine learning.  Kaggle allows users to publish and explore data sets and their reusability metrics, such as size, view count, vote count, update date, download count, and their proprietary usability score.  I explored the datasets by collecting 6,900 records through the API and using web scraping tools to retrieve metadata. 

Figure 10: Graphs of download count and usability rating

Through our research, we had some interesting findings.  Many of these revolved around Kaggle’s usability rating.  According to Kaggle’s website, “It’s a single number we calculate for each dataset that rates how easy-to-use a dataset is based on a number of factors, including level of documentation, availability of related public content like kernels as references, file types and coverage of key metadata” (Goldbloom and Hamner, 2010).  This metric gave us some interesting findings, as seen in graphs like the figure above.  There is a strong correlation between the usability score and the download count in a given dataset, which could indicate that users seek out data with a higher usability rating. 

Figure 11: Usability Rating and Year of Last Update 

We also found that the median usability rating of a dataset has increased overtime for the last four years.  The Usability Rating was introduced in May of 2019. The datasets created prior have been retrofitted with the Usability Rating calculation. Since then, the median usability rating of datasets has gradually climbed. This could indicate that the existence of such a metric has encouraged data producers to reach a higher Usability Rating by completing the metadata requirements and propel a culture of reusability on Kaggle.

Observations: 

Figure 12: Data sharing, reuse process, and metrics relationship

Conclusion and Future work:

For this project, we have worked on obtaining metadata from several repositories, such that all of the metadata have the metrics that can impact the reusability of datasets for each repository.  First, we tried using the method of API request, but we later realized that HTML scraping would be a better option for obtaining the metadata.  After scraping each repository, we created numbers of plots for analysis, such as correlation plots, whisker plots, and bar plots.  Based on these data visualizations, we have made observations about how each metric affects data reusability.  In the future, the plan would be to collect more samples from data repositories that include different metrics we have not yet analyzed.  This would allow us to further observe which factors make data highly reusable.  The important question to consider for future project is that how are data reusability scores correlated with data downloads, citations, and views.

Links to Our Work:

  • Data studio: Here you can find additional data visualizations related to our work
  • Presentation: Here you can find the presentation that we have collaborated with the University of Virginia DSPG team and presented to the National Science Foundation, as well as the DSPG 2021 symposium
  • UVA website
  • Github

Citations: