Machine Learning in A/B Testing - Part2

Experiment and AB Testing Machine Learning Models

Experiment and AB Testing Machine Learning Models

Launch an AB Test

We are using an application of A/B testing in marketing in this article.

1. Email AB Test

In marketing, the A/B tests are often done via e-mail with two groups that have different subject lines and they measure the click-through rate to check users’ engagements. In this case, launching A/B testing is simply sending emails to groups and do an analysis on the responses after collecting the data back.

Example service: Campaign Monitor helps businesses with launching e-mail A/B testings.

2. Web Page, Web Application AB Test

In the web page A/B testing case, users in two groups will see two different web pages and we can look at which page has better user engagements.

…

Machine learning model A/B testing architecture. Source: ML in Production

3. About BERT

First of all, when you think about A/B testing, you may want to group your target users based on their web browsing content. A typical example of web browsing data is text, and with conventional natural language processing methods, keyword-based classification models are used. BERT, on the other hand, understands the context of the entire text, and thus is expected to be able to classify users' interests more precisely.

For example, suppose you have a text database of web browsing data of your target users, and you want to divide them into two groups: those interested in machine learning and those interested in software engineering. In that case, we need the text data browsed by users belonging to the two groups for which we know the correct answer, and we can use that data to fine-tune BERT's pre-trained model. Since we do not have such data, in this experiment we will show how to fine-tune BERT using movie review data.

4. BERT Fine-tuning

One of the most common uses of BERT is to download a model that has been pre-trained with a large amount of text and fine tuning it with a small amount of data. In this article, we will show you how to download a pre-trained model from hugginfface and fine tune it with sample code.

Install Required Packages

 pip install datasets
pip install torch
pip install transformers

 # When using Google Colab
!pip install datasets
!pip install torch
!pip install transformers

Load and Check Movie Review Dataset

 from datasets import load_dataset
 
raw_datasets = load_dataset("imdb")
print(raw_datasets)

Select Samples for Train and Test

 sample_train_val = raw_datasets['train'].shuffle().select(range(0,2000)).to_pandas()
sample_test = raw_datasets['test'].shuffle().select(range(0,500)).to_pandas()

Import Libraries

 from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, recall_score 
from sklearn.metrics import precision_score, f1_score
 
from transformers import TrainingArguments, Trainer
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import EarlyStoppingCallback
 
import torch
import numpy as np

Define Pretrained Tokenizer and Model

 model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

Preprocess Dataset

 # Define a simple class inherited from torch dataset
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels
 
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item
 
    def __len__(self):
        return len(self.encodings["input_ids"])
 
sample_x = list(sample_train_val["text"])
sample_y = list(sample_train_val["label"])
 
X_train, X_val, Y_train, Y_val = train_test_split(sample_x, sample_y, test_size=0.2)
X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=512)
X_val_tokenized = tokenizer(X_val, padding=True, truncation=True, max_length=512)
 
input_train = Dataset(X_train_tokenized, Y_train)
input_val = Dataset(X_val_tokenized, Y_val)

Define Evaluation Metrics

 def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)
    print(classification_report(labels, pred))
 
    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred)
    precision = precision_score(y_true=labels, y_pred=pred)
    f1 = f1_score(y_true=labels, y_pred=pred)
 
    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1_score}

Mount Google Drive (In case of using Google Colab)

 from google.colab import drive
 
# Bert-output is an empty folder in your Drive
drive.mount('/content/gdrive')
%cd /content/gdrive/'My Drive'/'bert-output'

Fine-tune BERT

 # Define Training Arguments
args = TrainingArguments(
    output_dir="models",
    evaluation_strategy="steps",
    eval_steps=100,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    seed=0,
    load_best_model_at_end=True,
)
 
# Define Trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=input_train,
    eval_dataset=input_val,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)
 
# Fine-tune pre-trained BERT
trainer.train()

Load Fine-tuned BERT and Run Prediction

 # Load test data
X_test = list(sample_test["text"])
X_test_tokenized = tokenizer(X_test, padding=True, truncation=True, max_length=512)
 
# Create torch dataset
test_dataset = Dataset(X_test_tokenized)
 
# Load trained model
model_path = "models/checkpoint-100"
model = BertForSequenceClassification.from_pretrained(model_path, num_labels=2)
 
# Define test trainer
test_trainer = Trainer(model)
 
# Make prediction
raw_pred, _, _ = test_trainer.predict(test_dataset)
 
# Preprocess raw predictions
y_pred = np.argmax(raw_pred, axis=1)

Analyze Experimentation and AB Test

With the data that is collected from the activity of users of the website A/B testing or email A/B testing, we can compare the efficacy of the two designs A and B. Simply comparing mean values wouldn’t be very meaningful, as we would fail to assess the statistical significance of our observations.

In order to do that, we will use a two-sample hypothesis test. Our null hypothesis H0 is that the two designs A and B have the same efficacy, i.e. that they produce an equivalent click-through rate, or average revenue per user, etc. The statistical significance is then measured by the p-value, i.e. the probability of observing a discrepancy between our samples at least as strong as the one that we actually observed.

And depending on the metric, we apply different statistical tests.

Discrete metrics Example discrete metrics: click-through rate

Statistical tests used:

Fisher’s exact test
Pearson’s chi-squared test

Continous metrics

Example continuous metrics: average spending per user, average time spent on a web page

Statistical tests used:

Z-test
Student’s t-test
Welch’s t-test

Conclusion

Machine Learning and Data Science models often can be tricky, there are many things that can mislead you during model development.

Never trust 100% of the model performance just based on offline testing
Design your experiment & A/B testing carefully — never miss the minimum required sample size
When classify users according to their web browsing content (text), BERT can be used
When training BERT, download the pre-trained model and fine-tune it
Use statistical tests to compare models performance

Reading List:

Back

	pip install datasets
	pip install torch
	pip install transformers

	# When using Google Colab
	!pip install datasets
	!pip install torch
	!pip install transformers

	from datasets import load_dataset

	raw_datasets = load_dataset("imdb")
	print(raw_datasets)

	sample_train_val = raw_datasets['train'].shuffle().select(range(0,2000)).to_pandas()
	sample_test = raw_datasets['test'].shuffle().select(range(0,500)).to_pandas()

	from sklearn.model_selection import train_test_split
	from sklearn.metrics import classification_report
	from sklearn.metrics import accuracy_score, recall_score
	from sklearn.metrics import precision_score, f1_score

	from transformers import TrainingArguments, Trainer
	from transformers import BertTokenizer, BertForSequenceClassification
	from transformers import EarlyStoppingCallback

	import torch
	import numpy as np

	model_name = "bert-base-uncased"
	tokenizer = BertTokenizer.from_pretrained(model_name)
	model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

	# Define a simple class inherited from torch dataset
	class Dataset(torch.utils.data.Dataset):
	def __init__(self, encodings, labels=None):
	self.encodings = encodings
	self.labels = labels

	def __getitem__(self, idx):
	item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
	if self.labels:
	item["labels"] = torch.tensor(self.labels[idx])
	return item

	def __len__(self):
	return len(self.encodings["input_ids"])

	sample_x = list(sample_train_val["text"])
	sample_y = list(sample_train_val["label"])

	X_train, X_val, Y_train, Y_val = train_test_split(sample_x, sample_y, test_size=0.2)
	X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=512)
	X_val_tokenized = tokenizer(X_val, padding=True, truncation=True, max_length=512)

	input_train = Dataset(X_train_tokenized, Y_train)
	input_val = Dataset(X_val_tokenized, Y_val)

	def compute_metrics(p):
	pred, labels = p
	pred = np.argmax(pred, axis=1)
	print(classification_report(labels, pred))

	accuracy = accuracy_score(y_true=labels, y_pred=pred)
	recall = recall_score(y_true=labels, y_pred=pred)
	precision = precision_score(y_true=labels, y_pred=pred)
	f1 = f1_score(y_true=labels, y_pred=pred)

	return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1_score}

	from google.colab import drive

	# Bert-output is an empty folder in your Drive
	drive.mount('/content/gdrive')
	%cd /content/gdrive/'My Drive'/'bert-output'

	# Define Training Arguments
	args = TrainingArguments(
	output_dir="models",
	evaluation_strategy="steps",
	eval_steps=100,
	per_device_train_batch_size=8,
	per_device_eval_batch_size=8,
	num_train_epochs=2,
	seed=0,
	load_best_model_at_end=True,
	)

	# Define Trainer
	trainer = Trainer(
	model=model,
	args=args,
	train_dataset=input_train,
	eval_dataset=input_val,
	compute_metrics=compute_metrics,
	callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
	)

	# Fine-tune pre-trained BERT
	trainer.train()

	# Load test data
	X_test = list(sample_test["text"])
	X_test_tokenized = tokenizer(X_test, padding=True, truncation=True, max_length=512)

	# Create torch dataset
	test_dataset = Dataset(X_test_tokenized)

	# Load trained model
	model_path = "models/checkpoint-100"
	model = BertForSequenceClassification.from_pretrained(model_path, num_labels=2)

	# Define test trainer
	test_trainer = Trainer(model)

	# Make prediction
	raw_pred, _, _ = test_trainer.predict(test_dataset)

	# Preprocess raw predictions
	y_pred = np.argmax(raw_pred, axis=1)