4 ways to encode categorical features with high cardinality

July 8, 2023 ebookreaders

In this article, we will go through 4 popular methods to encode categorical features with high cardinality: (1) Target encoding, (2) Count encoding, (3) Feature hashing and (4) Embedding.

We will explain how each method works, discuss its pros and cons and observe its impact on the performance of a classification task.

Summary

Introducing categorical features

Categorical features are a type of variables that describe categories or groups (e.g. gender, color, country), as opposed to numerical features that measure a quantity (e.g. age, height, temperature).

There are two types of categorical data: ordinal features which categories can be ranked and sorted (e.g. sizes of T-shirt or restaurant ratings from 1 to 5 star) and nominal features which categories don’t imply any meaningful order (e.g. name of a person, of a city).

Why do we need to encode categorical features?

Encoding a categorical variable means finding a mapping that converts a category to a numerical value.

While some algorithms can work with categorical data directly (like decision trees), most machine learning models cannot handle categorical features and were designed to operate with numerical data only. Encoding categorical variables is a necessary step.

Besides, some machine learning libraries require all data to be numerical. This is the case of scikit-learn for example.

Why one-hot encoding is not suited to high cardinality?

A common approach to encoding categorical features is to apply one-hot encoding. This method encodes categorical variables by adding one binary variable for each unique category.

If a feature describing colors has three categories [red, blue, green], a one-hot encoder would transform it into three binary variables, one for each category.

If a categorical feature has hundreds or thousands of categories, applying one-hot encoding would add hundreds or thousands of binary variables to the features vector. Models struggles with large sparse data as they face the the curse of dimensionality: it is more difficult to search in a solution space with more dimensions, easier to overfit, computational time is increased, as well as space complexity.

So how to to encode highly cardinal categorical features without increasing the dimensionality of feature vectors?

Application on an AdTech dataset

We will answer that question by applying four encoding techniques on the dataset of Criteo’s Display Advertising Challenge to predict click-through rates on display ads.

It is a famous Kaggle challenge launched in 2014 by Criteo, a French online advertising company specialized in programmatic advertising and real time bidding. Click through rate (CTR) of an ad is the number of times it was clicked divided by the number of times it was displayed on a page.

Datasets in AdTech usually contain ID variables with high cardinality, such as site_id (ID of the website on which an ad is displayed), advertiser_id (ID of the brand behind the ad), os_id (ID of operating system of the user for whom the ad is displayed).

The Criteo dataset consists of 1 million rows, 39 anonymized columns: 13 numerical variables and 26 categorical variables. Their cardinality is in the table below. We see that many features have very high cardinalities (above 10k).

Cardinaliy of categorical features in the Criteo dataset

The dataset contains 241,338 categories overall. Applying one hot encoding would mean transforming the feature space from 39 dimensions to 241,351 dimensions. It is clear that applying computations to a sparse matrix of over 241k columns is very costly and inefficient.

Let us split the dataset into training and testing set and explore the encoding methods.

from sklearn.model_selection import train_test_split

features = df.columns[1:]
categorical_features = [feature for feature in features if feature[0] == "C" ]
x_train, x_test, y_train, y_test = train_test_split(df[features], df["label"])

Overview of each encoding method

(1) Target encoding

We use the target encoder of the library category_encoder, which is defined as follows:

Features are replaced with a blend of the expected value of the target given particular categorical value and the expected value of the target over all the training data.

from category_encoders.target_encoder import TargetEncoder

enc = TargetEncoder(cols = categorical_features).fit(x_train, y_train)
X_train_encoded = enc.transform(x_train)
X_test_encoded = enc.transform(x_test)

Note that we only fit the encoder on the training dataset and then use the fitted encoder to transform both training and testing set. As we do not have access to y_test in real life, it would be cheating to use it to fit the encoder.

DIMENSION OF ENCODED FEATURES SPACE: 39 columns, X_train_encoded and X_test_encoded have the same shape as x_train and y_train.
PROS:
– parameter free
– no increase in feature space
CONS:
– risk of target leakage (target leakage means using some information from target to predict the target itself)
– When categories have few samples, the target encoder would replace them by values very close to the target which makes the model prone to overfitting the training set
– does not accept new values in testing set

(2) Count encoding

With count encoding, also called Frequency encoding, categories are replaced by their frequency in the dataset. If the ID 3f4ec687 appears 957 times in the column C7, then we would replace the 3f4ec687 by 957.

If two categories appear the same amount of times in the dataset, such a method encodes them with the same value although they do not hold the same information. This creates what we call a collision: two distinct categories are encoded with the same value.

from category_encoders.count import CountEncoder

enc = CountEncoder(cols = categorical_features).fit(x_train, y_train)
X_train_encoded = enc.transform(x_train)
X_test_encoded = enc.transform(x_test)

DIMENSION OF ENCODED FEATURES SPACE: 39 columns, X_train_encoded and X_test_encoded have the same shape as x_train and y_train.
PROS:
– easy to understand and implement
– parameter free
– no increase in feature space
CONS:
– risk of information loss when collision happens
– can be too simplistic (the only information we keep from the categorical features is their frequency)
– does not accept new values in testing set

(3) Feature hashing

Feature hashing projects categorical features into a feature vector of fixed dimension that is smaller than the original feature space. The dimension of that feature vector needs to be defined beforehand. This is done by using the hashing trick which applies a hash function to the features and uses their hash values as indices in the feature vector.

There are two ways to implement feature hashing:

either we apply hashing feature by feature (there one feature space per feature and we therefore need to choose one dimension parameter per feature)
or we hash all features together (there one single feature space for all features, one parameter to choose, but collisions can happen between features).

This method is not parameter free, we need to choose the size of the hashing space. We follow the advice of the great article Don’t get tricked by the hashing trick and choose the hashing size = 20*k where k is the number of categorical features (in our case k=26).

from category_encoders.hashing import HashingEncoder

enc = HashingEncoder(
    cols = categorical_features, 
    n_components=20*len(categorical_features)
).fit(x_train, y_train)
X_train_encoded = enc.transform(x_train)
X_test_encoded = enc.transform(x_test)

DIMENSION OF ENCODED FEATURES SPACE: 533 columns
PROS:
– limited increase of feature space (as compared to one hot encoding)
– does not grow in size and accepts new values during inference as it does not maintain a dictionary of observed categories
– captures interactions between features when feature hashing is applied on all categorical features combined to create a single hash
CONS:
– need to tune the parameter of hashing space dimension
– risk of collision when the dimension of hashing space is not big enough

(4) Embedding

Embeddings is a popular encoding technique from Deep Learning and natural language processing (NLP). It consists on building a trainable lookup table that maps categories to a a fixed-length vector representation. During the training, the weights in the table are updated to better describe the similarities between categories.

We will follow this Keras tutorial to “build an encoder model that codes the categorical features to embeddings, where the size of the embedding for a given categorical feature is the square root to the size of its vocabulary. We train these embeddings in a simple NN model through back-propagation.”

def build_input_layers(features):
    input_layers = {}
    for feature in features:
        if feature in categorical_features:
            input_layers[feature] = tf.keras.layers.Input(
                shape=(1,),
                name=feature,
                dtype=tf.string
            )
        else:
            input_layers[feature] = tf.keras.layers.Input(
                shape=(1,),
                name=feature,
                dtype=tf.float32
            )
    return input_layers

def build_embeddings(size=None):
    input_layers = build_input_layers(features)
    embedded_layers = []
    
    for feature in input_layers.keys():
        if feature in categorical_features:
            # Get the vocabulary of the categorical feature
            vocabulary = sorted(
                    [str(value) for value in list(x_train[feature].unique())]
                )
            # convert the string input values into integer indices
            cardinality = x_train[feature].nunique()
            pre_processing_layer = tf.keras.layers.StringLookup(
              vocabulary=vocabulary, 
              num_oov_indices=cardinality,
              name=feature+"_preprocessed"
            )
            pre_processed_input = pre_processing_layer(input_layers[feature])
            # Create an embedding layer with the specified dimensions
            embedding_size = int(math.sqrt(cardinality))
            embedding_layer = tf.keras.layers.Embedding(
                input_dim=2*cardinality+1,
                output_dim=embedding_size,
                name=feature+"_embedded",

            )
            embedded_layers.append(embedding_layer(pre_processed_input))   
        else:
            # return numerical feature as it is
            embedded_layers.append(input_layers[feature])
    
    # Concatenate all the encoded features.
    encoded_features = tf.keras.layers.Concatenate()([
                tf.keras.layers.Flatten()(layer) for layer in embedded_layers
            ])
    
    # Apply dropout.
    encoded_features = tf.keras.layers.Dropout(rate=0.25)(encoded_features)

    # Perform non-linearity projection.
    encoded_features = tf.keras.layers.Dense(
        units=size if size else encoded_features.shape[-1], activation="gelu"
    )(encoded_features)
    return tf.keras.Model(inputs=input_layers, outputs=encoded_features)

def build_neural_network_model(embedding_encoder):
    input_layers = build_input_layers(features)
    embeddings = embedding_encoder(input_layers)
    output = tf.keras.layers.Dense(units=1, activation="sigmoid")(embeddings)

    model = keras.Model(inputs=input_layers, 
                        outputs=output)
    
    model.compile(
        optimizer=tf.keras.optimizers.Adam(),
        loss=tf.keras.losses.BinaryCrossentropy(),
        metrics=[tf.keras.metrics.AUC()]
    )
    
    return model

embedding_encoder = build_embeddings(64)
neural_network_model = build_neural_network_model(embedding_encoder)

# Training
def build_dataset(x, y):
    dataset = {}
    for feat in features:
        if feat in categorical_features:
            dataset[feat] = np.array(x[feat]).reshape(-1,1).astype(str)
        else:
            dataset[feat] = np.array(x[feat]).reshape(-1,1).astype(float)

    return dataset, np.array(y).reshape(-1,1)

x_train, y_train = build_dataset(x_train, y_train)
x_test, y_test = build_dataset(x_test, y_test)
history = neural_network_model.fit(x_train, y_train, batch_size=1024, epochs=5)

X_train_encoded = embedding_encoder.predict(x_train, batch_size=1024)
X_test_encoded = embedding_encoder.predict(x_test, batch_size=1024)

DIMENSION OF ENCODED FEATURES SPACE: 64 columns
PROS:
– limited increase of feature space (as compared to one hot encoding)
– accepts new values during inference
– captures interactions between features and learns the similarities between categories
CONS:
– need to tune the parameter of embedding size
– the embeddings and a logistic regression model cannot be trained synergically in one phase, since logistic regression do not train with backpropagation. Rather, embeddings has to be trained in an initial phase, and then used as static inputs to the decision forest model.

Benchmarking the performance to predict CTR

We fit a simple logistic regression to predict the CTR and generate predictions with each encoding method.

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=10000)
model.fit(X_train_encoded, y_train)
y_pred = model.predict(X_test_encoded)

We compute the log loss, AUC, recall and average of the predict CTR. The results are in the table below. The average CTR in the dataset is 22.6%.

We first notice that the AUC is pretty low for all encoding methods, which is mostly due to the fact that we used a very simple model, a more complex model than Logistic Regression would be more suited for this task.

The first three methods (target, count and hashing encoding) did not allow the model to capture enough signals to predict the CTR: the average predicted CTR is very low compared to the true average, the recall is also very close to zero and the AUC close 0.5 informs us that the model is almost random. The model with embedding shows the highest AUC and recall, and an average predicted CTR close to the target.

Conclusion

We’ve explored four techniques to encode categorical data with high cardinality: target encoding, count encoding, feature hashing and embedding. Specifically, we learned:

The challenge of working with high-cardinality categorical data
The advantages and limitations of each of the four techniques
How to implement each technique as part of a classification task to predict Click Trough Rate

To go further

Notebook with the code in this article: Encoding high cardinality categorical features
Cheat-sheet on how to choose the most relevant encoding technique: Encode Smarter: How to Easily Integrate Categorical Encoding into Your Machine Learning Pipeline
More on the concept of collision: Don’t be tricked by the Hashing Trick
More on categorical embedding: Dense NN with Categorical Embeddings

More content at PlainEnglish.io.

Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

4 ways to encode categorical features with high cardinality was originally published in Artificial Intelligence in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

https://ai.plainenglish.io/4-ways-to-encode-categorical-features-with-high-cardinality-1bc6d8fd7b13?source=rss—-78d064101951—4
By: Aicha Bokbot
Title: 4 ways to encode categorical features with high cardinality
Sourced From: ai.plainenglish.io/4-ways-to-encode-categorical-features-with-high-cardinality-1bc6d8fd7b13?source=rss—-78d064101951—4
Published Date: Thu, 06 Jul 2023 00:48:43 GMT

Did you miss our previous article…
https://e-bookreadercomparison.com/how-to-change-poses-in-images-with-ai-a-complete-guide-to-draggan/