Open-Set Learning with Augmented Category by Exploiting Unlabelled Data (Open-LACU)

The rapid growth of real-world datasets, whether in computer vision, audio processing, or medical diagnosis, demands robust and flexible classifiers. One key challenge is handling vast amounts of unlabelled data along with the ever-present possibility of encountering entirely new categories.

Dr Emile Engelbrecht, under the supervision of Professor Johan du Preez, tackled this problem in his dissertation, Open-set learning with augmented category by exploiting unlabelled data (Open-LACU). This article summarises this research, looking into the research background, its objectives, and how Open-LACU aims to transform future classification systems.

Background: Classification Meets Real-World Complexity

Machine learning – and in particular, neural network–based classification – has seen tremendous success. From labelling individual pixels in images to identifying diseases based on medical scans, classification systems are used to make sense of data by assigning it to relevant categories. However, many real-world datasets do not neatly fit into a small set of predefined classes:

1. High Annotation Cost

Labelling data can be prohibitively expensive. In simple cases (like labelling cat or dog images), minimal expertise is required, but for applications such as semantic segmentation of complex images or diagnosis of medical conditions, only trained experts can provide reliable labels. These experts often need to sift through hours of video or audio or meticulously label every pixel in thousands of images, driving up the costs exponentially.

2. Presence of Novel Categories

Traditional supervised classification systems have no mechanism to say, “I don’t know what this is”. If a classifier trained only to distinguish cats and dogs is suddenly given an image of an owl, it will still classify it as either a cat or a dog. In applications with safety or health implications – such as autonomous vehicles or medical diagnostics – incorrect classifications can have dire consequences.

3. Continual Accumulation of Unlabelled Data

Real-world applications generate new data constantly. Some of this new data may belong to known categories (e.g., additional pictures of cats and dogs), while some may represent entirely new and unknown patterns (like images of owls). Handling unlabelled data effectively while identifying new, unseen categories is a core challenge that modern classifiers must address.

Semi-Supervised Learning and Novelty Detection

Two important subfields of classification research provide partial solutions to these challenges:

1. Semi-Supervised Learning

Semi-supervised learning (SSL) aims to reduce the annotation burden by leveraging large amounts of unlabelled data. Instead of requiring labels for every single sample, SSL uses a small subset of labelled data to guide learning and then extrapolates patterns from unlabelled samples. This helps train classifiers without incurring the massive costs associated with fully supervised learning.

2. Novelty Detection

Novelty detection (ND) focuses on identifying samples that do not belong to any of the known, labelled categories. If a classifier is only trained on a specific set of categories (e.g., cats and dogs), ND helps it recognise when an input image is neither a cat nor a dog. Such a capability is vital where misclassification could cause harm, such as misdiagnosing a rare disease.

Although SSL and ND both deal with unlabelled data and unknown categories, they have traditionally been studied somewhat in isolation. SSL typically assumes the unlabelled data primarily comes from known classes, while ND focuses on spotting completely unfamiliar inputs at test time. In reality, unlabelled data can simultaneously include known and novel categories and so bridging these two subfields is crucial.

Observed-Novel vs. Unobserved-Novel Categories

One of the main sources of confusion in existing research is the term “novel category”. Dr Engelbrecht’s dissertation clarifies that novelty occurs on two distinct levels:

1. Observed-Novel Category

This refers to patterns that appear in the training data but are not labelled (and hence “unknown” from the model’s perspective). They are present in the training set but never receive a specific label. Because they occur during training – even if unlabelled – the model can potentially learn some representation of them, if guided appropriately.

2. Unobserved-Novel Category

This type of novelty is completely absent during training and appears only once the model is deployed. A classic example would be a newly discovered medical condition that did not exist (or was not labelled) at the time the model was trained.

Both categories matter because they have different implications for practical classification. Observed-novel data could, in theory, be separated out by the model if it realises there is a cluster of unlabelled examples that looks “different” from the labelled ones. In contrast, unobserved-novel data demands a mechanism to say “this is truly new” when deployed in the field. Dr Engelbrecht argues that conflating these two – treating any unlabelled or new sample as if it is unobserved – obscures critical differences in how models learn to handle them.

Introducing Open-LACU

To address the challenges of SSL and ND in a unified manner, Dr Engelbrecht developed Open-LACU. The conceptual vision behind Open-LACU is threefold:

1. Semi-Supervised Classification of Known Categories

The model primarily needs to classify samples from K labelled (source) categories, while effectively leveraging abundant unlabelled data relevant to these K categories.

2. Detection of an Observed-Novel Category (K+1)

Some portion of the unlabelled data may not belong to any of the K source categories. Since these patterns still appear during training (albeit unlabelled), the model aims to form a separate “observed-novel” category to handle them.

3. Detection of an Unobserved-Novel Category (K+2)

The biggest leap: enabling the model to say, “I have never seen this category before” during deployment. This unobserved-novel recognition is critical for safety, adaptability, and practical real-world use.

By adopting this three-pronged approach, Open-LACU avoids the pitfalls of either purely semi-supervised or purely novelty-detection-centric models. In other words, it can handle partially labelled datasets rife with novelty while still being vigilant about newly emerging categories once the model is in use.

Objectives and Significance of the Research

Dr. Engelbrecht’s dissertation outlines three primary objectives that form the backbone of Open-LACU:

The first goal is to build a neural network for LACU that can learn with augmented categories using unlabelled data (i.e., detect the observed novel category). This strongly emphasises distinguishing samples belonging to K-labelled categories from samples that appear unlabelled but do not align with these categories.

Next was linking SSL and open-set recognition (OSR). This usually entails a “reject” option, where the model can decide a sample does not belong to any known class and label it as unknown. Generative OSR classifiers often use fake samples to regularise this “reject” boundary. On the other hand, generative SSL models similarly leverage unlabelled data to improve learning of K source categories. Unifying SSL and OSR under one framework allows the model to comprehensively manage known and newly emerging classes.

Combining the insights from LACU and SSL-OSR, the final aim was to demonstrate a prototype capable of distinguishing between labelled categories, observed-novel categories in unlabelled training data, and unobserved-novel categories appearing only during deployment. This prototype provides proof of concept and pinpoints the areas where further research is needed.

Fig 1: General neural network classifier trained to distinguish cats from dogs. Note that the same neural network is used for both cases.

Methodology and Experimental Insights

To test the core ideas behind Open-LACU, Dr. Engelbrecht used small-scale image datasets such as MNIST, SVHN, and CIFAR. These datasets allowed the team to:

  • Simulate Different Categorical Scenarios

By designating certain digits (e.g., 0–4) or object categories as labelled and leaving others unlabelled, the experiments captured how well a model can detect unlabelled (observed-novel) patterns while still classifying the known categories.

  • Evaluate Unobserved-Novel Detection

Data corresponding to completely unseen categories was introduced only at test time. The effectiveness of Open-LACU hinges on correctly tagging these truly unseen categories as novel rather than forcing them into one of the learned classes.

  • Identify Future Challenges

The research revealed that while the Open-LACU framework is viable, there are complexities in training a single model to juggle semi-supervised classification, observed-ND, and unobserved-ND simultaneously. For instance, the presence of numerous unlabelled categories can create ambiguity, requiring careful algorithmic design to manage boundary decisions for each category type.

Looking Forward

Open-LACU is a transformative step toward building classifiers that are more reflective of real-world conditions. By identifying different types of novelty and leveraging unlabelled data effectively, classifiers can be more cost-efficient (through SSL) and safer (through ND). While the prototype implementations on small-scale datasets confirm the feasibility of the approach, significant work remains to make Open-LACU truly application-grade. Future research may explore:

  • Scaling to Large Datasets: Most real-world applications will require models capable of processing millions of samples across diverse categories.
  • Refining Training Procedures: Balancing the objectives of classification, observed-novel detection, and unobserved-novel detection is non-trivial. More sophisticated algorithms may integrate active learning or domain adaptation techniques.
  • Expanding to Different Data Modalities: While most current studies focus on image datasets, audio, text, and time-series data offer further complexity (and opportunities) for Open-LACU approaches.
  • Integrating with Continual Learning: Real-world data changes over time. Ensuring that a model can update itself or adapt to new categories as they appear in ongoing data streams is another vital area of research.

Ultimately, the Open-LACU framework holds promise in advancing classifiers that can learn with minimal labelling requirements while safely and accurately handling new, unexpected data. As Dr. Engelbrecht’s dissertation highlights, unifying SSL and ND not only addresses immediate classification challenges but also lays the groundwork for truly adaptable and future-proof systems.

Download and read the complete research: https://scholar.sun.ac.za/items/49db76f2-698c-467f-8458-5caa12b38ea4