Skip to main content

Algorithms, systems, and theories for exploiting data dependencies in crowdsourcing

Award Information

This website is based upon work supported by the National Science Foundation under Grant No. IIS-2007941, collaborative with NSF IIS-2008155 . Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


Project Summary

Data are abundantly available to encode knowledge in many domains, such as biomedical research, online commerce, open government, education, and public health. Machine learning is a powerful tool to discover novel knowledge from data and to help individuals and organizations make informed decisions. However, machine learning needs to be bootstrapped by human-annotated knowledge, which can be expensive to obtain and also contain human errors. The team of researchers discovers and exploits the dependencies in the data, via novel methodologies to significantly reduce the cost and noises when providing critical knowledge for machine learning. The research outputs, including algorithms, systems, and theories, are sufficiently generic to benefit many domains where machine learning is applicable. By conducting the fundamental research, the team will train undergraduates and graduates for the STEM workforce in the nation.

The researchers will collaborate to develop algorithms, systems, and theories for reducing costs and noises when annotating dependent data, termed as “structured annotations”, to provide supervision knowledge for machine learning. While the dependencies can make data annotations costly and error-prone, the researchers view the dependencies as a useful inductive bias for selective and accurate annotations. In particular, they propose a human-in-the-loop system to aid the construction of proper probabilistic graphical models to encode the dependencies. They combine contextual and multi-armed bandits with scalable graph inference algorithms to reduce labeling costs. Based on the graphical bandits, the team addresses the budget allocation when querying labels of the same data point repetitively for robustness. With noisy human annotations, the team formulates optimization problems and algorithms to jointly infer the annotator competences and the ground truth labels of the data. From the theoretical perspective, the project will advance the active learning in crowdsourcing settings with more realistic noise distributions and will analyze the regrets in structured annotations. The project will result in datasets, algorithms, and a testbed system that benefit not only the core machine learning research community but also many domains that use machine learning.



[EMNLP20] Nasim Sabetpour, Adithya Kulkarni and Qi Li. OptSLA: an Optimization-Based Approach for Sequential Label Aggregation. EMNLP'20 Findings, 2020. [code]

[WebConf21] Minghong Fang, Minghao Sun, Qi Li, Neil Zhenqiang Gong, Jin Tian and Jia Liu, Data Poisoning Attacks and Defenses to Crowdsourcing Systems. Proc. of the Web Conference 2021.

[ICDM21] Nasim Sabetpour, Adithya Kulkarni, Sihong Xie, and Qi Li, Truth Discovery in Sequence Labels from Crowds. Proc. of 2021 IEEE Int. Conf. on Data Mining (ICDM’19), to appear, 2021.