Abstract |
Relation extraction from distant supervision generates a lot of false positives in training data for natural language processing (NLP) models. Crowdsourcing is effective at gathering ground truth for training NLP systems, but it can also be quite expensive. Active Learning is a method to optimize crowdsourcing by picking those examples from the data that are most representative or most likely to need correction. In this talk, I will discuss an ongoing work to predict which distant supervision seeds are likely to be false positives and have them annotated by the crowd. Compared to annotating a random sub-sample, we expect our active learning method to provide higher quality training data and result in better performance of our relation extraction model.
|