skweak: A Python Toolkit For Applying Weak Supervision To NLP Tasks

Labelled data remains a scarce resource in many practical NLP scenarios. The only available option is often to collect and annotate texts by hand, which is expensive and time-consuming.

skweak is built on a very simple idea: Instead of annotating text manually, we define a set of labeling functions to automatically label our documents, and then aggregate their results to obtain a labelled version of our corpus.

Weak supervision with skweak goes through the following steps:

Start: Prepare the (unlabelled) corpus onto which the labelling functions will be applied. skweak is built on top of SpaCy, so you need to convert your documents into SpaCy Doc objects.

Step 1: Define a range of labelling functions that will take those documents and annotate spans with labels. Those labelling functions can take a variety of forms, from handcrafted heuristics to machine learning models, gazetteers, etc.

Step 2: Aggregate their results in order to obtain a single, probabilistic annotation (instead of the multiple, possibly conflicting annotations from the labelling functions). This is done in skweak using a generative model that automatically estimates the relative accuracy and possible confuctions of each labelling function.

Step 3: Based on these aggregated labels, train your final NLP model.

The skweak toolkit provides a Python API to apply labeling functions and aggregate their results with just a few lines of code. This toolkit can be applied to both sequence labeling and text classification and introduces features like handling underspecified labels and creating document-level labeling functions.

Paper : https://aclanthology.org/2021.acl-demo.40.pdf

Github : https://github.com/NorskRegnesentral/skweak

Source: https://www.facebook.com/dsociety.uit.ise/posts/pfbid02KNgojUvSQ2P9g4fSw...