my-document-classifier

This model is a fine-tuned version of distilbert-base-cased on an unknown dataset. It achieves the following results on the evaluation set:

Loss: 0.0313
Accuracy: 0.9910
F1: 0.9910

Model description

3.0 Evaluation To evaluate our model taking a look at the metrics obtained from each epoch would be crucial. As stated in Section 2:

Accuracy & F1 metrics increase in epoch 1,2 and 3 but suddenly decrease in epoch 4. Furthermore, the training loss does decrease as the models cycle through more epochs, however, this happens in an almost exponential manner, decreasing as the number of epoch increases as in Figure 2.1.3. In addition, the validation loss decreases in epoch 1,2 and 3 but suddenly increases at epoch 4 and starts to fluctuate up and down, but primarily increasing in further epochs.

Generally, a lower training loss is a good indicator. Hence, if the training loss is seen to decrease in further epochs this can be considered a positive. The results from the training and testing of our model are undeniably good, however, in our case the validation loss actually increases and fluctuates. This absolutely does correlate with the training loss as considering the decrease in training loss gets exponentially lower, this is a clear sign of overfitting (Goodfellow et al., 2016) because the model is starting to learn our specific training set too good to the point that it performs poorly and unseen data (test set) (Chollet, 2021). Therefore, overfitting is a serious problem present in our model. Hence, to counter that issue, onwards we will be using the 3rd checkpoint pushed to rngrye/my-document-classifier as the metrics are the most stable at epoch number 3.

Aside from the overfitting issue, (referring to the model at epoch 3) performance metrics of 0.982063 (Accuracy) and 0.981928(F1) are really good, indicating the model is doing great on the test and being able to classify texts into the 5 labels accurately. Additionally, training loss of 0.0315 is relatively low and a validation loss of 0.05465 is the lowest one amongst all the epochs which both show great results for the training and test dataset. 3.3 Texts with ambiguous themes Overall, the model only struggles with texts with ambiguous themes, not being able to guess the other themes present in the text with a high score. But the main thing to keep in mind is that our model was built for single class classification. Hence, the percentages shown are not the model saying the text is 20% Sport and 80% Business, rather its the model saying im 80% sure the text is Business and only 20% sure the text is Sport. Nevertheless, it is still an issue and will be discussed further in Section 4.0. Other than ambiguous themes, we have prepared 5 word documents, one for each theme to be passed into our model in Gradio to test the classification feature. These documents are unambiguous to the human reader. A few of the results are as follows:

Intended uses & limitations

4.1 Limitations The time period given to execute our project was relatively short. However, the outcome was satisfying and the project’s aim was achieved. However, this did cause us to not have enough time to find a dataset with more than one label per entry which would have resolved the ambiguity issue. 4.2 Strengths The model performs brilliantly on texts with clearly defined themes. This is shown in the results at Section 3.3. Additionally, our interface allows users to drag and drop or upload files directly from their computer. Therefore, there would be no need to open each file, then copy, then paste the texts into the interface, significantly increasing the quality of life of our project. On top of that, a display is used to show how confident the model is in making its predictions along with other classes. This can assist users to notice ambiguity in certain texts even if it might not solve it. 4.3 Weaknesses As mentioned before, the main issue is classifying texts with ambiguity. Some of the reasons are:

Training Data Reflection This issue seems to come from how the model was trained. If most of the training documents had only one clear topic, then the model never really learned how to handle documents with mixed topics. It’s basically doing what it was trained to do, which is to pick the best single label as models trained primarily on single-topic data is unlikely to perform well on documents containing multiple overlapping themes (Zhang et al., 2018)

Single Label Limitation Right now, the model is set up to only predict one label per document. This model struggles when it’s not sure which label to pick. It ends up giving a high score to one label and almost ignores the others. For example, if a news article talks about both sports and politics, the model can only choose one. Thus, it might miss or ignore the other. This is a known limitation of standard text classification models that are not designed for multi-label outputs (Scikit-learn, n.d.) In addition, our interface and model implementation was also very basic. The model can only read texts and in this era most illustrations do not only include texts and pictures or images can make a huge difference of what a document’s theme is. 4.4 Suitability for Real World Use In order to use this in the real world there would need to be a setting where documents would need to be classified into either Sports, Politics, Technology, Economy or Entertainment or an individual or corporation that would want to do so but that is very unlikely.

However, we have provided and are familiar with the framework for document classification and if in the future as an employee we are tasked to classify company documents into any given classes and given ample time, this is very realistic. Furthermore, I am confident that there are datasets out there with very practical labels that can be used in business or educational settings and can find value by saving precious time of workers by classifying these documents. Time is needed to find suitable and usable ones.

Training and evaluation data

1.1 Data Preprocessing The dataset, sourced from Kaggle's text classification dataset, consists of documents labeled with single topics. Text data is tokenized using DistilBERT's tokenizer to convert raw text into input features compatible with the model. Data is split into training and validation sets to evaluate performance during fine-tuning.

Training procedure

1.2 Model Selection and Find Tuning Base Model: DistilBERT, a lightweight variant of BERT, is chosen for its efficiency and performance in NLP tasks. Fine-Tuning: The model is fine-tuned on the labeled dataset using PyTorch and the Hugging Face Transformers library. Training involves adjusting hyperparameters to optimize performance. Checkpoints: Model snapshots are saved at each epoch to track performance and mitigate overfitting. 1.3 Evaluation Metrics Performance is measured using accuracy, F1 score, training loss, and validation loss. Confusion matrices are generated to analyze classification results across epochs.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 16
eval_batch_size: 16
seed: 22002423
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
num_epochs: 10

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy	F1
0.5447	1.0	112	0.0777	0.9821	0.9819
0.0752	2.0	224	0.0958	0.9731	0.9730
0.038	3.0	336	0.0711	0.9865	0.9865
0.0191	4.0	448	0.0795	0.9865	0.9865
0.0066	5.0	560	0.0900	0.9865	0.9865
0.0063	6.0	672	0.0945	0.9865	0.9865
0.0014	7.0	784	0.1040	0.9865	0.9865
0.0011	8.0	896	0.1023	0.9865	0.9865
0.001	9.0	1008	0.1027	0.9865	0.9865
0.0009	10.0	1120	0.1026	0.9865	0.9865

Framework versions

Transformers 4.52.4
Pytorch 2.6.0+cu124
Datasets 3.6.0
Tokenizers 0.21.1

Downloads last month: 31

Safetensors

Model size

65.8M params

Tensor type

F32

Model tree for rngrye/my-document-classifier

Base model

distilbert/distilbert-base-cased

Finetuned

(302)

this model

rngrye
/

my-document-classifier