Commencis Unveils Turkish Banking-Focused Language Model: Introducing Commencis LLM

Choosing the Mistral 7B as the base model for the Commencis LLM, the team focused on fine-tuning the large language model for approximately three months with customized datasets.

Commencis introduced its Banking and Finance-focused big language model with fluent Turkish in a blog post. The team, which tried leading models such as Llama 2, Mistral, Mixtral, Zephyr and OpenChat 3.5 before implementing the model, chose to proceed with Mistral 7B as the base model for Commencis LLM. Mistral 7B was chosen because of its large language model’s proven ability to handle complex data sets and specific terminologies.

Within the scope of the project, high-performance GPUs from Amazon Web Services were used. In this context, the Commencis team used g5.2xlarge and g5.48xlarge options. G5.2xlarge supported Commencis’ significant resource needs, while g5.48xlarge was used for the most resource-intensive operations.

According to the company, the engineering team at Commencis focused on fine-tuning the large language model for about three months. During this period, the team worked to improve the model’s understanding of Turkish and its ability to capture semantic relationships more accurately than ever before.

Data sets used in the training process of the model

Additionally, during the development of the big language model, the Commencis team focused on creating a customized dataset specifically designed to strengthen the Turkish capability of the model, especially for the banking and finance sector. This process involved extensive collection of various types of data, including customer service records, financial reports, market analyses, and legal and regulatory documents. Thus, the model was aimed to grasp industry-specific jargon, terminology and expressions.

However, a data cleaning and curation phase was carried out in which low quality and irrelevant data were removed. The team placed emphasis on balancing data from various language structures to ensure linguistic diversity and increased the processing capacity of the model.

Additionally, considerations such as gender, ethnicity, and geographic location were included to reduce bias and increase the diversity of responses.

In their post, the team also states that the banking industry is faced with the problem of scarcity of Turkish datasets for pre-trained open source large language models. Launching a strategic initiative for this purpose, the team aimed not only to collect and improve existing data, but also to create new data sets that could provide a deeper understanding of Turkish and banking-related terminologies.

The Commencis team leveraged OpenAI’s GPT-4 services to generate thousands of synthetic instructions aimed at generating data suitable for supervised fine-tuning training, while leveraging terms from the banking and finance lexicon and relationships between defined categories and subcategories. These synthetic instructions played a crucial role in fine-tuning the models. So much so that these instructions contributed to setting a new standard for the preparation and testing stages.

Additionally, the team used a special filtering method to improve the quality of Q&A interactions. However, the Turkish ability, accuracy, clarity and coverage of the data were comprehensively evaluated using GPT-4.

Commencis also includes comparisons with other models in its blog post. According to evaluation criteria inspired by the article Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena by Lianmin Zheng and colleagues, Commencis LLM outperforms well-known big language models in many areas.

The criteria in question stand out as Turkish language proficiency, relevance, accuracy, precision and completeness of answers. These metrics serve as the cornerstone for making necessary adjustments to data sets and fine-tuning parameters.

According to the company, customized, in-house deployment of proprietary models such as Commencis LLM ensures compliance and data control within the sensitive framework of the banking and finance industry.

Source: Webrazzi / Prepared by Irem Yildiz

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button