### Obtaining the datasets from the Huggingface HUB

In [None]:
from datasets import list_datasets


existing_datasets = list_datasets()
print("The HuggingFace contains {} datasets".format(len(existing_datasets)))

We are only interested in the "emotion" dataset. This dataset is composed of english tweets labelled with one out of the following six categories: _anger_, _disgust_, _fear_, _joy_, _sadness_, and _suprise_. We can make use of the function `load_dataset()` in order to gather it.

In [None]:
from datasets import load_dataset

emotion_dataset = load_dataset('emotion')

The dataset is structured similarly as a Python dictionary (i.e., we can use python dictionary syntax) and provides access to three different split of the data: `train`, `validation`, and `test`. As an example, we can check how many items we have in each of these splits. 

In [None]:
print('{} elements in the training set'.format(len(emotion_dataset['train'])))
print('{} elements in the validation set'.format(len(emotion_dataset['validation'])))
print('{} elements in the test set'.format(len(emotion_dataset['test'])))

In [None]:
print(emotion_dataset['train'][:3])

In the next step, we are going to be converting huggingface datasets into pandas structures. Mostly we do this to take advantage of the rich pandas API and visualization features.

In [None]:
import pandas as pd
emotion_dataset.set_format(type='pandas')
df = emotion_dataset['train'][:]
df.head()

As seen in the table, the lable is an integer value between 0 and 5. For convenience, we create a function that translate this integer value into the apropiate text for that label. 

In [None]:
def label_int2str(value: int):
    return emotion_dataset['train'].features['label'].int2str(value)


df['category'] = df['label'].apply(label_int2str)
df.head()

The following graph shows that the data is heavily imbalanced. If we compare the number of tweets labelled as _joy_ and the number of tweets labelled as _surprise_.  

In [None]:
import matplotlib.pyplot as plt

df['category'].value_counts(ascending=True).plot.barh()
plt.title('Frequency of tweets within each classes')
plt.show()

How long are tweets? Is there any difference between the lenght of the tweets from different categories. 


In [None]:
df['Words per tweet'] = df['text'].str.split().apply(len)
df.boxplot('Words per tweet', by='category', grid=True, showfliers=False,color='black')
plt.suptitle('')
plt.xlabel('')
plt.show()

Once finished, if we do not need to make use of the pandas API, the format should be reseted (back to HuggingFace Datasets.) This is achieved with the reset_format function.  

In [9]:
emotion_dataset.reset_format()

Get a tokenizer trained/tailored to the model we'll use later. 

In [None]:
from transformers import AutoTokenizer

model_cpkt = 'distilbert-base-uncased'
tokenizer  = AutoTokenizer.from_pretrained(model_cpkt)
print('The number of tokens in the vocabulary is {}'.format(tokenizer.vocab_size))

We need to define a function that given a batch of sentences, it returns a tokenized version of the sentences in that batch. In the following we define such a function. Observe that this is not a really complex function; instead, it leverages the created tokenizer to convert text to tokens. 

In [12]:
def tokenize(batch):
    return tokenizer(batch['text'],padding=True,truncation=True)

We convert all our dataset to the corresponding list of tokens ids.

In [None]:
emotion_encoded = emotion_dataset.map(tokenize, batched=True, batch_size=None)

In [None]:
print(emotion_enconded['train'].column_names)

In [None]:
print(emotion_encoded['train']['input_ids'][0:3])

In [None]:
from transformers import AutoModel
import torch
model_ckpt = 'distilbert-base-uncased'
device = torch.device('cuda'if torch.cuda.is_available() else 'cpu')
model = AutoModel.from_pretrained(model_ckpt).to(device)

In [None]:
text = 'this is a test'
inputs = tokenizer(text, return_tensors='pt')
print(f"Input tensor shape: {inputs['input_ids'].size()}")

In [None]:
inputs['input_ids']

In [None]:
inputs = { k:v.to(device) for k,v in inputs.items()}
with torch.no_grad():
    outputs = model(**inputs)
print(outputs)

In [None]:
outputs.last_hidden_state.size()

In [None]:
outputs.last_hidden_state[:,0].size()

In [30]:
def extract_hidden_state(batch):
    inputs = {k:v.to(device) for k,v in batch.items() if k in tokenizer.model_input_names}
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state
    return {'hidden_state': last_hidden_state[:,0].cpu().numpy()}

In [31]:
emotion_encoded.set_format('torch', columns=['input_ids','attention_mask','label'])

In [None]:
emotion_hidden = emotion_encoded.map(extract_hidden_state,batched=True)

In [None]:
emotion_encoded['train'].column_names
emotion_hidden['train'].column_names

In [None]:
import numpy as np

x_train = np.array(emotion_hidden['train']['hidden_state'])
y_train = np.array(emotion_hidden['train']['label'])
x_valid = np.array(emotion_hidden['validation']['hidden_state'])
y_valid = np.array(emotion_hidden['validation']['label'])
x_train.shape, y_train.shape


In [None]:
from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(max_iter=3000)
lr_clf.fit(x_train,y_train)
lr_clf.score(x_valid,y_valid)

In [None]:
lr_clf.score(x_valid,y_valid)

In [None]:
text = ['This course has arrived to its end. Have a nice weekend!']
tokenized_text = tokenizer(text, return_tensors='pt')
hidden_state = extract_hidden_state(tokenized_text)
prediction = lr_clf.predict(hidden_state['hidden_state'])
prediction