FastText ile Türkçe Metinlerin Sınıflandırılması

In the field of Natural Language Processing (NLP), text classification refers to the process of dividing text documents into certain categories or classes. For example, tasks such as determining whether an email is spam or labeling a customer review as positive or negative are examples of such classifications. The texts we try to classify in this article will be scanned Turkish documents (land title document, identity document, etc.) and in this article, we will examine FastText, a powerful and fast tool to perform text classification tasks. Figure 1 shows sample documents used.

Figure 1 Document Examples

What is FastText?

FastText is an open-source text classification and word embedding tool developed by Facebook for use in natural language processing (NLP). Primarily, it is employed to process, classify, and generate word representations for text data. Due to its speed and effectiveness, it is widely utilized by researchers and developers.

Advantages of FastText

1. Fast Training and Predictions: It can efficiently process large datasets and make predictions quickly, which is crucial for large-scale text classification projects.
2. Word Embedding: FastText captures the semantic representation of each word in the text, enabling it to learn the meaning and similarities of words. It is particularly suitable for tasks requiring language analysis at the word level.
3. Multi-Language Support: FastText can work with many different languages and create custom word embedding matrices for these languages. Pre-trained models are available for 157 different languages, trained on Common Crawl and Wikipedia content.

How to Classify Text with FastText?

Step 1: Data Collection and Preparation

The first step is to gather data and process it. Training data should include text documents and their corresponding class labels. Data cleaning and feature extraction are also done at this stage. If your data is already in a machine-readable text format, preprocessing involves removing single-character strings, numbers, punctuation marks, and excess spaces, converting all words to lowercase, and correcting misspelled words. Unfortunately, FastText does not have the ability to perform data preprocessing. We carried out these tasks using different libraries. Correcting misspelled words is particularly important for semantic accuracy. For this task, we used the TurkishNLP library. The code snippet below demonstrates the steps and results for correcting a sample misspelled sentence.

from turkishnlp import detector
obj = detector.TurkishNLP()
obj.download()
obj.create_word_set()
lwords = obj.list_words("vri kümsi idrae edre ancaka daha güezl oalbilir")
corrected_words = obj.auto_correct(lwords)
corrected_string = " ".join(corrected_words)
print(corrected_string)

"veri kümesi idare eder ancak daha güzel olabilir"

However, if the data is in scanned documents, i.e., in image format, the text in the images needs to be converted into a machine-readable format. One of the most common methods used to convert images to text is through Optical Character Recognition (OCR) applications. Tesseract OCR, EasyOCR, and GOCR are among the most popular OCR libraries. In this project, we used the EasyOCR library to obtain text from images.

Once we have obtained the text using OCR, we can proceed to preprocess, clean, and train the document classification model.

Step 2: Converting Data to FastText Format

FastText uses a specific data format. You need to convert your training data to this format. Each line should contain a text document and the corresponding class labels for that document. Class labels should be in the form "__example-label__". An example for this project in this format could be given as follows:

__tapu__ Tapu Document Content

__kimlik__ Kimlik Document Content

Step 3: Training the FastText Model

To train the FastText model, we use the transformed training data in the required format. This includes a training dataset containing text documents and their class labels (e.g., deed, identity, etc.). FastText processes the data, learns the word embedding matrix, and creates a classification model. A simplified overview of how the FastText training model works is as follows:

Word Embedding: FastText starts by learning the word embedding representations of each word in the training data. Word embedding represents the semantic meaning of words and is learned using continuous bag-of-words (CBOW) or skip-gram techniques. FastText uses these embedding representations as a fundamental component for text representation.
Model Selection: FastText supports two types of models: supervised and unsupervised. The supervised model is used to classify labeled data, while the unsupervised model is used to find similarities between texts.
Linear Classification: In supervised model training, words and labels are represented with vectors. It learns these vector representations to ensure that words and their corresponding labels have similar vector representations. The learning process involves determining how close the vector representation of a word is to the vector representation of the relevant label.
High Hierarchical Classifier: FastText uses a high (hierarchical classifier) to manage complex relationships between classes and accelerate large classification tasks. This is achieved by placing classes in a hierarchical tree structure. Each node represents the probability of the corresponding class. Labels are associated with specific nodes in this tree structure.

Below is an example training code. Training is performed using the train_supervised function. In this classification application, the following parameters used by the function have been modified: "input" refers to the location of the text file containing the training data, "learning rate" determines how fast or slow a model will learn, "epoch" represents a training period where the entire training dataset is processed once by the model and weights are updated, "dim" is the length of the vectors we want to obtain. After the model is trained and saved, it can be loaded and used.

import fasttext
model = fasttext.train_supervised(input=”trainin-text-path”, lr=1.0, epoch=25, dim=50)
model.save_model(“my_fasttext_model.bin”)

Step 4: Model Evaluation and Adjustment

After training, we need to evaluate the model with test data. We can use metrics such as accuracy, F-1 score, etc., to measure the performance of the model. Based on the obtained results, we can adjust hyperparameters to improve the model's performance.

model = fasttext.load_model(“my_fasttext_model.bin”)
model.predict(“test_metni”)

When the model is loaded and the class of the text is predicted, the result obtained looks like this, with classification categories and similarity scores:

((u’’__label__tapu, u’__label__kimlik’ …), array([0.939923, 0.592677 …]))

Conclusion

FastText is a powerful tool for efficiently performing text classification tasks. With its capabilities for fast training and quick prediction, it is possible to achieve successful results even when working with large datasets. Therefore, it is recommended to consider FastText for your text classification projects.

To learn more about FastText and see application examples, you can review its official documentation and sample projects. Best of luck!

Turkish Text Classification with FastText