On the Performance of the Japanese-Built AI Model, Tsuzumi

This article is an English translation of an article written by the author in Japanese at pickerlab.net.

The rapid pace at which large language models (LLMs) and generative AI tools such as ChatGPT, Gemini, and Claude are being released has created what can only be described as a competitive arms race in AI development. As a Japanese professional working in data science and natural language processing (NLP), I often find myself discussing these advancements with clients during casual conversations. Inevitably, the question arises: “How do domestically developed Japanese LLMs compare?”

To put it bluntly, my perspective—and one that I believe is shared by many others deeply involved in AI and NLP—is that it is virtually impossible for Japanese-developed models to compete on a global stage. At least, that’s my honest assessment.

To provide some context, back in the era when BERT was the dominant language model, international models often struggled with Japanese language comprehension. This left room for Japanese developers to create language models specifically tailored for Japanese, which had practical value within the domestic market. However, with the release of models like GPT-3.5, the situation changed dramatically. These international models demonstrated an astonishingly high level of Japanese language understanding, leading many experts, myself included, to feel that the role of Japanese developers in LLM development has significantly diminished.

The underlying reason for this shift lies in the way LLMs process language. These models convert natural language—whether it’s Japanese, English, or any other language—into numerical vector representations. Once the text has been accurately encoded as vectors, the processing that follows is largely language-agnostic. In other words, the distinction between languages becomes irrelevant at the computational level.

As a Japanese professional observing these developments, it’s clear to me that the technological gap between domestic and international players in AI has widened.

Table of contents

The Reality of Domestic LLMs in Japan
Testing Tsuzumi 7B, GPT-3.5 Turbo, GPT-4.0, and GPT-4o with 13 Questions
1. Question Selection Criteria
2. Example Question
Tsuzumi performed worse than GPT-3.5 Turbo.
The results are understandable given the model’s scale.
In IT business applications, Japanese companies should focus on steady adoption of best practices rather than rushing to catch up.

The Reality of Domestic LLMs in Japan

Developing large language models (LLMs) requires an immense investment, with costs reportedly ranging from tens to hundreds of billions of yen, and daily operational costs nearing 100 million yen. Frankly, no Japanese IT company has the financial capacity to sustain this level of investment. Even the largest IT firms in Japan would quickly fall into the red if they attempted it (and it’s worth noting that even OpenAI operates at a staggering deficit).

As a result, the general consensus among professionals in the AI and data science industries is to avoid engaging with domestic LLMs and not to place any expectations on them.

That said, for users who have recently taken an interest in AI due to the rise of generative AI, it might seem natural to wonder if Japanese-made AI would be better suited for use in Japan. Personally, however, I find it troublesome when such expectations are placed on me. When asked, “What about domestic LLMs?” during work discussions, I usually respond immediately with, “There’s no need to consider them.”

At the same time, I’ve started to feel that dismissing domestic LLMs outright without even trying them is somewhat unprofessional. As someone who is paid for their expertise, it’s not entirely fair to judge without direct experience.

With that in mind, I decided to conduct a very simple test of “Tsuzumi,” a domestically developed LLM by NTT Data, which is accessible through Azure OpenAI Service.

NTTデータ | Trusted Global Innovator

tsuzumi

https://www.nttdata.com/jp/ja/lineup/tsuzumi

「tsuzumi」は、NTTが開発した大規模言語モデルで、軽量ながらトップレベルの日本語処理能力を持っています。「tsuzumi」のパラメータサイズは6億から70億までと比較的小さく、学習やチューニングに必要なコストを抑えることができます。「tsuzumi」は、英語と日本語の両言語に対応しており、単一のGPUまたはCPUで推論を行うことができます。

Testing Tsuzumi 7B, GPT-3.5 Turbo, GPT-4.0, and GPT-4o with 13 Questions

I conducted a simple accuracy test using 13 questions I selected from various sources, including the Japanese university entrance exam (Center Test), employment aptitude test (SPI), and general knowledge questions (economics and law) as bentimark.

Using a scoring system where a correct answer earns 1 point, a partially correct answer earns 0.5 points, and an incorrect answer earns 0 points, I calculated the percentage of correct answers for each model. A perfect score of 13/13 would correspond to 100%.

Question Selection Criteria

The questions were chosen entirely at my discretion and designed to be challenging, particularly for LLMs. They included:

4 general knowledge questions (economics and law)
3 math questions from Japan’s university entrance exams (Center Test/University Common Test)
6 reading comprehension questions from employment aptitude tests (SPI)

The difficulty level was intentionally set so that GPT-3.5 Turbo would struggle, while GPT-4.0 might have a chance of achieving full marks.

Example Question

To give an idea of the type of problems used, here’s an example math question:

A theater group’s total number of members decreased by 40% from last year, leaving 480 members this year. By gender, the number of women decreased by 25%, while the number of men decreased by 62.5%. Calculate the number of women in the theater group this year. (Round to the nearest whole number if necessary.)

The correct answer is 360 women.

This kind of calculation question is representative of the math problems included in the test. Results for each model will be discussed in the next section.

SPI対策問題集

【割合】問題3-16 割合｜SPI 非言語｜SPI対策問題集

https://spi.careermine.jp/higengo/ratio/g_question_math_ratio_16

SPIで出題される「割合（非言語）」の練習問題を一挙大公開。さらに、本サイトで掲載する問題はすべて解説付き。適性検査SPIのwebテスト&テストセンター試験を突破したい方は、ぜひ本サイトの例題で練習してみてはいかがでしょうか。

As for the reading comprehension questions, they included typical problems like selecting the correct conjunction to fill in a blank or choosing a sentence that does not contradict the target passage. These are common in the Japanese Center Test and similar exams.

Tsuzumi performed worse than GPT-3.5 Turbo.

The results showed GPT-4o achieving a 77% accuracy rate, GPT-4.0 at 53%, GPT-3.5 Turbo at 12%, and Tsuzumi at just 4%. This means Tsuzumi’s performance was below GPT-3.5 Turbo.

Tsuzumi managed to score only 0.5 points by partially answering just one knowledge-based question. It struggled with reading comprehension questions, often failing to understand the instructions, making it impractical for use.

Additionally, Tsuzumi’s performance worsened as input prompts increased, suggesting that it’s not suitable for handling large volumes of text, such as in RAG (Retrieval-Augmented Generation) systems. Since my purpose was to evaluate whether Tsuzumi could be integrated into an RAG system, I concluded that it’s unlikely to be a viable option.

Even with input lengths of about 1,000 characters, Tsuzumi felt inadequate compared to GPT-4o, which can handle and understand texts as long as 10,000 characters with much greater accuracy.

The results are understandable given the model’s scale.

To be honest, the performance of LLMs (Large Language Models) largely depends on two factors: the size of the training dataset and the number of parameters in the model. The number of parameters corresponds to the “nodes” in the model’s neural network, and the dataset represents the amount of information the model has “studied.” Simply put, a model with more parameters and a larger dataset will naturally perform better.

Of course, training methods and model architecture also matter. However, major overseas models are developed by highly skilled engineers earning millions, so I assume their design and training processes are top-notch.

Creating datasets is also costly, and the computational resources required to train a model increase exponentially with the number of parameters. This significantly drives up development costs.

Tsuzumi, with its 7 billion parameters, is modest compared to recent models with over a trillion parameters. It seems to have been developed with a more constrained approach, likely avoiding the “arms race” of massive budgets. In that sense, achieving this level of performance with 7 billion parameters is impressive. For reference, GPT-3.5 Turbo is rumored to have hundreds of billions of parameters.

From a parameter perspective, Tsuzumi might be performing well. However, based on my experience using it, it doesn’t seem suitable for practical use cases.

In IT business applications, Japanese companies should focus on steady adoption of best practices rather than rushing to catch up.

This might be a side note, but Japan’s IT industry, which is several years behind global trends, doesn’t need to compete with overseas players. I believe domestic AI initiatives like Tsuzumi are not aiming to “win” but rather to gain insights from global leaders or use their development as a marketing tool.

For those of us in the IT field, this perspective may seem obvious, but I suspect many users might have a different view. There’s no need to be at the cutting edge. Instead, we can calmly observe overseas technologies and case studies, identify what needs to be done, and walk the well-paved paths that global pioneers have already struggled to create.

At this point, Japan is many laps behind, so there’s no point in trying to catch up. It’s enough to simply move forward at our own pace from where we currently stand.

That said, IT professionals are partly to blame for creating unrealistic expectations by using buzzwords like “cutting-edge” or “latest technology” as part of sales pitches. This is particularly evident in some of the commentary surrounding Tsuzumi. It’s important for Japan’s IT industry to maintain a humble attitude, appreciating the hard work and expertise of overseas leaders whose technologies and methods we are fortunate to utilize.

Still, thinking about the people involved in developing Tsuzumi leaves me with mixed feelings. It must have been a difficult challenge—delivering results within a limited budget in what felt like a losing battle. Perhaps the developers were purely driven by technical curiosity, which kept them motivated despite the odds.

AI projects often involve an overwhelming amount of uncertainty, requiring teams to push forward without clear answers about what’s meaningful or where the true value lies. It’s a mentally taxing process, almost like a form of spiritual training. I don’t know the details of how Tsuzumi was developed, but I wonder what the atmosphere was like in the development team.