Three Methods to Improve the Accuracy of Large Language Models (LLM)

Three Methods to Improve the Accuracy of Large-Scale Language Model (ChatGPT) Responses

The structure of this article is as follows:
  1. Why Large-Scale Language Models (ChatGPT) Make Errors in Search/Inference/Computation
    1. Understanding how large-scale language models work can help understand why ChatGPT inferences (including search) are bound to make errors.
    2. Three methods to improve the accuracy of large-scale language models (ChatGPT)
  2. Improving Accuracy through Large-Scale Expansion (Difficult for Users to Participate)
    1. GPT4 model with ChatGPT
    2. Column: "Null pointer" and "Gah!"
    3. Limitations of Large-Scale Expansion
  3. The Main Method to Achieve 60% to 90% Accuracy: Prompt Engineering (Fine-Tuning)
    1. ChatGPT + GPT-4 vs. Bing
    2. Sample Project for Fine-Tuning
    3. Limitations of Prompt Engineering (Fine-Tuning)
  4. Method to Achieve 90% or Above Accuracy: Combination with Existing Programs
    1. Utilizing the superior comprehension and text refinement ability of large-scale language models without relying on human skills
    2. Sample Project for Intercepting the Inference Part

Why does large-scale language model (ChatGPT) search and inference make mistakes?

Even with simple knowledge queries, ChatGPT makes mistakes. One clear reason is that the text ChatGPT is trained on is outdated, dating back more than a year. Until it was able to integrate with web search (Microsoft's Bing), it couldn't even answer questions about today's weather.
However, even with queries about facts that ChatGPT should have learned prior to 2021, frequent errors occur. If you use it expecting it to be similar to Google search, you will encounter many confusions due to the numerous inaccuracies it provides. For example, when asked about the "2018年の本屋大賞(2018 Bookstore Award)" that ChatGPT should have learned about, it displays a ranking that is completely different from the actual list (as of late June 2023).

2位:「波よ聞いてくれ」(イトウヒロミ 著)
3位:「氷酒村」(今井秀樹 著)
大賞	かがみの孤城	辻村深月
2位	盤上の向日葵	柚月裕子
3位	屍人荘の殺人	今村昌弘
The first place "おしりのこ(Oshirinoko)" and the third place "氷酒村(Hyoshumura)" are both of unknown origin, including the author's name. The second place "波よ聞いてくれ(Nami yo Kiite Kure)" ranked fourth in the 2020 "Manga Taisho".
Since the top 10 rankings are completely random, we decided to ask ChatGPT again using a larger-scale GPT-4 model.

1位: 「かがみの孤城」 (辻村深月)
2位: 「コンビニ人間」 (村田沙耶香)
3位: 「九月が永遠に続けば」 (恩田陸)
1st place is the correct answer. 2nd place is the 9th work in 2017. 3rd place is the winner of the 2004 Horror Suspense Award, but the author is not "恩田陸(Riku Onda)" but "沼田まほかる(Mahokaru Numata)". Why does ChatGPT make mistakes? You can understand the reason if you infer how ChatGPT is learning.

Understanding how large language models work makes it clear that ChatGPT will inevitably make mistakes in inference (including searches).

Using ChatGPT as a search engine often leads to incorrect results. This is because ChatGPT does not search the internet for information when asked questions.
What does ChatGPT do?
ChatGPT performs a "complement" or "fill in the blank" process, assuming that there are blanks in the text. Essentially, it is similar to a "fill in the blanks" language exercise.
Complete the sentence with an appropriate word.
(I am a cat. I don't have a name yet. I have no idea where I was born.
All I remember is crying "___________" in a dark and damp place place.
The correct answer is "meow meow".)
meow meow0.25
mya mya0.25

The user's "inquiry" is a fill-in-the-blank question where the "blank" is located at the end of the text.
ChatGPT does not search the internet based on user inquiries and does not actually perform calculations for input arithmetic problems.

ChatGPT's training is conducted as follows:
  1. Prepare a large amount of (non-blank) text.
  2. In the case of ChatGPT, based on the comparison with other large-scale language models, it is estimated that it has been trained on approximately 1000GB of text data.
    The Japanese Wikipedia, a huge online encyclopedia, is known to every Japanese person. The Japanese version of Wikipedia contains less than 4GB of text data as of the first quarter of 2023.
    In ChatGPT, it has been trained on more than 250 times the amount of text data compared to Japanese Wikipedia.
    By the way, the text data of Japanese Wikipedia is estimated to be several times the total amount of Japanese text that an average Japanese person would hear and read in their lifetime.
    (Estimated based on the average reading time, average viewing time of media, and survey data on the number of words spoken in a day by researchers at the University of Maryland)
  3. Conduct inference with ChatGPT using randomly placed blanks and random parameters.
  4. The size of the blanks can vary, ranging from a few tokens (token ≒ word) to 100 tokens, and they can be concentrated in one place or scattered in multiple places.
    The initial results only reflect the frequency of word occurrence and generate context-free text.
  5. Adjust the parameters, which are estimated to be around 350 billion (estimated based on the number of parameters and performance of GPT-3, which is 175 billion, and PaLM, which is 540 billion), as numerical values between 0 and 1.0 to minimize the difference between the generated text and the original text.
  6. By repeating this process with a large amount of text, the accuracy of the fill-in-the-blank questions improves.
  7. Repeat the training until a sufficient accuracy is achieved.
  8. Even for fill-in-the-blank questions with inputs other than the trained text, a sufficient accuracy can be obtained (known as generalization).
ChatGPT is the evolution of the GPT model,(which stands for Generative Pre-trained Transformer.)
complementation of 1 token
ChatGPT is the ○○ of the GPT model,

To train a model, leave "evolution" blank and learn to predict blank tokens from non-"evolution" tokens.
ChatGPT is the evolution of the ○○ model,
complementation of 2 tokens
ChatGPT is ○○ of the ○○ model,
ChatGPT ○○ the ○○ of the GPT model,
complementation of more than 3 tokens
ChatGPT is evolution of the GPT model, which ○○ for ○○ ○○ Transformer.
ChatGPT is the ○○ of the ○○ ○○, which stands ○○ ○○ ○○ ○○.
  1. Explain briefly what parameters are used for learning.
  2. Tokenize the text.
  3. Tokens are similar to words. For example, the present participle form "running" would be a separate token, so it is not strictly equal to a word.
  4. Represent each token with a vector of several hundred to several thousand dimensions and learn the values of these vectors.
  5. If we assume a simple 3-dimensional vector, "man" and "woman" would be represented as follows:
    man = [0.56, 0.01, 0.23]
    woman = [0.06, 0.61, 0.20]
    If we consider only 2 words that have been learned, the number of parameters is very small (probably less than 100). This includes the weights of the input token vector, deep learning networks, networks, and the connection weights between networks and output tokens.
  6. It has been observed that the accuracy of the fill-in-the-blank task improves significantly after learning beyond the total number of parameters surpassing 10 billion.
  7. GPT-3 learns with billions of parameters. ChatGPT uses around 350 billion parameters (estimated), and GPT-4 learns with 500 billion to several trillion parameters (estimated, although there are reports of 100 trillion, which are incorrect). Google's latest large-scale language model also surpasses 1 trillion parameters.
  8. Learn thousands of billions of parameters using a massive amount of text.
  9. GPT-3 learns with 1,000 gigabytes of text. Japanese Wikipedia text is less than 4 gigabytes (expected by 2023). This implies learning trillions of parameters using 250 times more text than Japanese Wikipedia.

Three methods to improve the accuracy of the large-scale language model (ChatGPT) (scaling up/prompt engineering/integration with existing programs)

The three methods are (1) scaling up the language model, (2) prompt engineering (fine-tuning), and (3) integration with existing programs.
There are also several other proposed methods that are not discussed in this article, either because actual systems have not been developed as of July 2023, or because they have been developed but have not been evaluated or have not achieved the expected level of accuracy.
For example, there have been developments in creating language models focused on specific fields such as medical papers and law and precedents to improve accuracy. Limited models are also available for use in ChatGPT.
  1. Accuracy improvement through scaling up (selecting the largest model among the choices presented to users)
  2. Increasing the size (parameters) of the language model improves the accuracy of all tasks of the language model.
    Re-training parameters for scaling up requires enormous computing power and electricity. User-level involvement in this process is rare, and it is mainly done by the providers of the language model (large companies, universities, research institutes).
    What users can do is select a larger model from the options provided.
  3. Prompt engineering (fine-tuning) is the key to achieving 60% to 90% accuracy
  4. Let's reconsider the "fill-in-the-blank" problems that the large-scale language model uses for training.
    "Mother scolded Hanako, saying, "______!"
    With only the above information, there are various possibilities such as "Do it quickly," "Be quiet," "Clean up," and it cannot be narrowed down.
    "Hanako didn't finish eating even after 30 minutes. Mother scolded Hanako, saying, "______!"
    "Hanako is making noise while talking on the phone. Mother scolded Hanako, saying, "______!"
    "It's mealtime, but Hanako left her toys out. Mother scolded Hanako, saying, "______!"
    Prompt engineering means adding context and clues, and ChatGPT can solve fill-in-the-blank problems more accurately by using the information of the context.
    In addition, the accuracy of the answer improves when important information is described towards the end of the prompt.
    Inputting "important information" at the beginning and end of the input sentence of the language generation AI leads to better results, according to Stanford University, among others.
    In order to prevent ChatGPT from generating unconvincing answers, it is recommended to give instructions such as "Answering in formal language" and "Answering in a well-formatted way".
    Various "tips" tailored to the application can be researched and released in papers and on the internet, such as GPT detection software wrongfully judged most of non-native English essays generated by AI.
    There is a more effective method related to prompt engineering. This method, called "Fine-tuning by prompting experts" was proposed in the "Chain of thought" paper (see here).
    Fine-tuning involves preparing pairs of data, consisting of prompts and the ideal response text that should be generated, in the range of tens to hundreds of thousands (the actual number depends on the difficulty of the problem). Only a small portion of layers in the large-scale language models are retrained.
    By keeping the majority of the parameters in the large-scale language models fixed and only retraining a subset, fine-tuning is a technique that improves the accuracy of neural networks with low computational cost.

    In the research of "Chain of thought" paper, it has been found that the accuracy of answers can be significantly improved by supplementing the thought process leading to the answer instead of generating the answer directly in response to user input (question or request).
    It requires a great deal of effort for users to supplement the thought process, but over time, large-scale language models learn how to supplement the thought process.
    For about 10,000 problems, by supplementing the thought process until the answer, the accuracy of arithmetic word problems, which was below 10% with the 175 billion parameter model of the large-scale language model GPT-3, has improved to the 30% range, and the accuracy, which was in the 30% range with the 540 billion parameter model of the large-scale language model PaLM, has improved to the 50% range.
  5. How to achieve an accuracy of 90% or more = Intercepting the inference and calculation part
  6. The details are explained in the link above, but it is a combination system that adds a linguistic interface using a large-scale language model to existing software.
    Since it is based on existing software, it can be considered as a limited-purpose chatbot.
    This includes issuing invoices and estimates involving money, creating financial statements, and educational software that serves as a substitute for private tutors. In the medical and legal fields, Currently, expert checks are essential, but in the future, tasks that can be entrusted to AI alone may also emerge. For example, let's consider a mathematical processing system.
    As of May 2023, when inquired with ChatGPT (equivalent to GPT-3.5), it generates absurd answers when it becomes slightly more complicated as follows.
    Prompt: What is the value of X in the equation 22-(4*5-3*X)=20?
    ChatGPT: Simplifying the equation, we have 22-20+15X=20, which becomes 2+15X=20. Solving for X, we get 15X=18, and X=1.2.
    Therefore, X is 1.2.
    Prompt: What is the value of X in the equation 22-(4*5-3*X)=20?
    ChatGPT: Simplifying the equation, we have 22-20+X=20, which becomes X=-2.
    Therefore, X is -2.
    In order to do that, we assume the use of existing mathematical software. Mathematical software is a software that solves equations (systems of equations), converts units, and sometimes solves word problems.
    The analysis of the problem statement is done by ChatGPT, which outputs a formatted text that suits the input of the mathematical software, such as a system of equations. In this case, it involves the mathematical part.
    It may involve removing "How many?" and replacing "×" with "*". The equations are processed by the mathematical software, and the output "X=6" and the process of expanding the equation are generated as a prompt for ChatGPT to present to the user.

1. Increasing Accuracy through Scaling (Difficult for Users)

To develop a language model with higher accuracy than ChatGPT, a grid computer consisting of hundreds to thousands of GPUs is required.
Currently, it is not possible to train a larger language model than ChatGPT on a local computer.
However, there are many large-scale language models that are smaller in scale than ChatGPT. Among them, it is possible to obtain one that can be executed on a local PC, which is comparable to "Llama 2" that rivals ChatGPT (3.5) Running "Llama 2" on a local PC as a Competitor to ChatGPT (3.5).
Unfortunately, it seems that the accuracy is lower than that of ChatGPT.
By limiting the training to specific types of text, it is possible to achieve accuracy similar to or even surpassing ChatGPT in the specific field that was trained. This effect has been observed.
However, the accuracy of responses in fields that were not trained is very low.

There are attempts to improve accuracy by limiting tokens to Japanese. Based on the results of papers on cross-lingual transfer learning of large-scale language models (training in German and then transferring to French) and mixed-language learning (learning with a mix of German and French tokens), it seems that multilingual learning using tokens from multiple languages is superior to learning with a single language, especially in similar languages.
For languages such as Japanese, Chinese, and Korean that are different from European and American languages, as of 2019, caution is necessary when it comes to simultaneous multilingual learning.

In the author's experience, to improve the accuracy of deep learning, it is much more effective to provide more training data than to adjust various parameters, algorithms, and the scale and configuration of neural networks. In the long run, it seems that the accuracy of models trained only in Japanese will not be able to catch up with the accuracy of models trained with Japanese + English, Chinese, European languages, and languages from the Asian region combined.
Loading one trillion parameters into memory alone would require 4,000 GB of memory, assuming each parameter is 4 bytes. PCs running Windows or Linux can only accommodate a maximum of 128 GB (some can handle 256 GB) of memory due to hardware constraints. Even using models that represent a single parameter with 4 or 8 bits, sacrificing some accuracy, it is still not possible to achieve the required scale at the PC level.

To create larger language models, organizations need to invest in high-performance GPUs, such as those costing around 5 million yen each, with a total of 256 GPUs (including servers and facilities) amounting to an infrastructure cost of around 3 billion yen. Additionally, they would need to budget over 10 million yen per month for electricity costs.

Using the NVIDIA® NVLink® Switch System, it is possible to connect up to 256 H100 units and accelerate exascale workloads. Furthermore, by using a dedicated Transformer Engine, it is possible to implement language models with parameters in the trillions.

ChatGPT offers a paid option to support a model called GPT-4, which has approximately 500 billion to several trillion parameters (estimated based on similar large-scale language models, although not officially disclosed). However, the accuracy of GPT-4 is clearly higher than that of ChatGPT when addressing the same inquiries.

In the case of using GPT-3, a language model with approximately 175 billion parameters, which is about half the parameter count of ChatGPT, to solve a US law examination, the performance ranked in the bottom 10%. However, with GPT-4, a larger-scale language model with approximately 500 billion to several trillion parameters, it achieved a top 10% performance, as discussed in the article"Is the latest and highest-performance large-scale language model GPT-4 really 'AI research'?".

Looking at the score progression of successive natural language processing models in the article"Score Transition of Historical Natural Language Processing Models",
it is evident that large-scale language models prior to the emergence of ChatGPT gradually approach and even surpass human scores in some cases.

Microsoft's Bing and Google's Bard are also moving towards increasing the number of parameters, with Google's cutting-edge prototype model surpassing 1.5 trillion parameters.

Without using prompt engineering or fine-tuning, when solving arithmetic word problems (single-variable arithmetic operations), GPT-3, a model with approximately 175 billion parameters, achieves an accuracy below 10%. However, with a model called PaLM, which has approximately 540 billion parameters, the accuracy increases to 30%. By combining prompt engineering, it is possible to further increase the accuracy by 20-30%, reaching the 30% range for GPT-3 (175 billion parameters) + prompt engineering, and the 50% range for PaLM (540 billion parameters) + prompt engineering.

ChatGPT is GPT-4 Model

By using ChatGPT Plus, the paid plan of ChatGPT, you can utilize GPT-4. As of July 2023, the screen looks as follows. At this point, the number of chats per hour using GPT-4 is limited. The limitations are gradually becoming more relaxed, but they are still quite strict.
Normally, chats in ChatGPT are based on GPT-3.5. By selecting GPT-4 with the button on the screen, you can have chats based on GPT-4.
Let's try it with a math word problem.

/* GPT-3.5では不正解 */
算数の合格者数(28人) + 国語の合格者数(25人) - 
 両教科とも合格した者の数(17人) = 36人

/* GPT-4では正解 */
42人 - 36人 = 6人となります。


10年以上前の話となるが、一部のプログラマクラスタで「ぬるぽ」という問いかけに対して「ガッ」と応えることがはやった。 バグの原因や結果となる"ヌルポインター"が値となる場合にショックを受けるということが由来であるが、特に意味も無く使われることの方が多い。 標準ChatGPTとChatGPT+GPT-4との違いを象徴するような結果なのでコラムとして紹介する。
コラム ぬるぽ がっ
AIに「ぬるぽ」と送った結果に「吹いた」 ChatGPTに"ぬるぽ"と問いかけると、「ゲッ!」と応えてしまう。

Limitations of Scaling Up

While scaling up improves the accuracy of fill-in-the-blank problems, it alone has its limitations.
The reasons for this are as follows.
  1. The scale of the text being used for training is saturated
  2. The amount of training text, which is one element of scaling up, does not increase. Currently, training is being done using several TBs of text. The text from the Japanese Wikipedia is approximately 4GB as of 2023, so we are using text that is several hundred times larger than the Japanese Wikipedia. It is conceivable that the amount of text being accumulated on the web will continue to increase, but the following concerns arise.
    Furthermore, the following major problem is highlighted.
  3. The rapid increase of texts generated by AI in future human societies
  4. It has been reported that adding trained text to the training data causes the problem of overfitting. ("What happens when you continue to teach generated AI with 'text created by generated AI'? British team reports it becomes 'useless') This problem is critically serious for AI. When a large-scale language model learns the text generated by another large-scale language model, the flaws are cumulatively amplified. By repeating the training, AI becomes progressively "stupid". The same phenomenon is observed in image generation models. Unless we set a mark that allows humans to distinguish texts created by humans from those created by AI before the contamination of generated AI progresses, it will become hopeless to increase the amount of text used for training AI in the future.
    Because of the seriousness of this problem, it has become common practice for developers of generated AI to archive the texts and images created by humans that have not been contaminated by generated AI as an archive.
    The texts added for training after 2023 may be limited to a very small number that are clearly recognizable as texts created by humans.
  5. Increase in computational complexity and power consumption due to parameter growth
  6. With the current methods, even with parallel processing, it may take several tens of seconds to obtain a single output. Furthermore, as the number of parameters increases, more GPUs are required for parallelization.
However, while scaling up the training text is challenging, scaling up the parameter count is relatively easier. It is expected that increasing the parameter count will become the main focus in the future.
Furthermore, as a strategy for improving the accuracy of large language models, prompt engineering has a greater impact than the benefits of scaling up.
Prompt engineering can also be addressed by the user and is expected to become a significant industry in the future.

2. Prompt Engineering (Fine-tuning) is the key to achieving 60% to 90% accuracy

While scaling requires relying on language model development companies, prompt engineering and fine-tuning can be done by the users themselves. It doesn't require any cost.
Although hiring skilled fine-tuning engineers may require high compensation, it is negligible compared to the investment for scaling.

As explained earlier, large language models learn through "fill-in-the-blank questions." They generate text for the blank part based on the contextual text outside the blank space in a probabilistic manner. How does it appear to have a conversation with users in models like ChatGPT?
For large language models like ChatGPT, it is merely a "fill-in-the-blank question" where the input sentence called a prompt in the first half serves as a clue to fill in the blank in the second half.
Naturally, the more information in the contextual non-blank text, the better the accuracy of the fill-in-the-blank. Furthermore, the more specific instructions in the contextual text, the more reasonable the fill-in-the-blank. For example, if there is an instruction such as "within 100 characters," the answer will be concise, and if there is an instruction like "explained in a way that elementary school students can understand," it solves arithmetic word problems without using equations. As of July 2023, specific instructions are sometimes ignored, but this happens only occasionally. It is expected to be resolved eventually.

The accuracy of the answers improves depending on how the prompts are written. It is important to master the technique of writing prompts.
While those who have mastered the technique are sometimes referred to as "prompt engineers," it is not true that they receive high salaries just because they have acquired the technique.
The following are some the key points:
Not only that, but an engineer who can describe their "thought process leading to an answer" in text in response to prompts is a true "prompt engineer". A "prompt engineer" plays a role similar to that of a teacher who can break down the knowledge of a specialized field into steps and teach it to AI. They must possess knowledge in both AI and the specialized field. For example, in the field of medicine, a "prompt engineer" must be well-versed in medicine. In the field of law, they must be an expert with extensive knowledge of law and judgments.

During the second AI boom known for projects like the Fifth Generation Computer Systems Project before 1990, there existed Knowledge Engineers who developed rule-based systems (expert systems) that formalized the knowledge of experts. They too were required to possess both rule-based AI knowledge and specialized knowledge. It is safe to say that the prompt engineer is a modern version of Knowledge Engineers.

Being a hybrid of technology and knowledge, they are highly skilled individuals who are naturally employed with exceptional treatment.

In this article, elementary school math word problems are used as the subject matter. While some specialized knowledge is required, such as solving multiple problems without using equations, it is possible for those who have gone on to study STEM fields in university to handle it. Typically, highly specialized knowledge in specialized fields is required.

Let's provide three examples of prompt engineering that require specialized knowledge, starting from simple ones:

ChatGPT+GPT-4 VS Bing

Here, I would like to analyze the case of solving arithmetic word problems using the commercial ChatGPT based on GPT-4 and Microsoft Bing. As for statistical analysis, it will be conducted in the implementation of fine-tuning.
Please note that based on the comparison with a basic large-scale language model without fine-tuning using OpenAI's Playground, it seems that the ChatGPT based on GPT-4 (also applicable to GPT-3.5 base) has clearly been fine-tuned for arithmetic word problems.
Bing also provides similar answers, so it seems to have incorporated fine-tuning or even more elaborate techniques.

Although these are the comparison results as of July 2023, ChatGPT+GPT-4 and Bing have their own strengths and weaknesses, and are almost on par with each other.
For explanations aimed at elementary school students, more than half of the cases with one unknown can be solved successfully.
For cases with two unknowns, it is often beyond the capabilities of human teachers, whether using equations or not, to provide explanations aimed at elementary school students.
Bing is capable of providing explanations approaching the level of human teachers for some of the easier problems.

ChatGPT+GPT-4 is correct: Bing is incorrect.
Solving without using equations, even without specific instructions.
長さ200mの列車Aが秒速20mで、長さ150mの列車Bが秒速15mで、同じ方向に進んでいます。 列車Aが列車Bに追いついてから追いこすまで、何秒かかるでしょう。
350m ÷ 5m/秒 = 70秒です。
られます。よって、答えは45秒です 。

ChatGPT+GPT-4 and Bing are both correct.
However, they both fail to provide explanations suitable for elementary school students. ChatGPT+GPT-4 uses force to continue the calculation, while Bing jumps straight to the solution without any explanation.
現在、しんのすけは5才、みさえは29才です。 みさえの年齢がしんのすけの3倍になるのは何年後でしょう。
みさえの年齢(29年 + x年) = しんのすけの年齢(5年 + x年) * 3

29 + x = 3 * (5 + x)
29 + x = 15 + 3x
29 - 15 = 3x - x
14 = 2x
x = 7



29+x=3(5+x) 29+x=15+3x 2x=14 x=7



ChatGPT+GPT-4 is incorrect / Bing is correct.
ChatGPT+GPT-4 does not use equations, but it is incorrect. Bing, on the other hand, is correct and uses equations.

1分には60km ÷ 60分 = 1km走ります。
1分には240km ÷ 60分 = 4km走ります。
半周(30km ÷ 2 = 15km)を走る前に追い越します。
30km ÷ 4km/分 = 7.5分かかります。
フェラーリが進んだ距離は4x km、プリウスが進んだ距離はx kmです。
つまり、4x=x+30 3x=30 x=10

ChatGPT+GPT-4:Improved from Incorrect to Correct / Bing Correct
For elementary school explanations, ChatGPT+GPT-4 is not good, but Bing is perfect.
GPT-4(6/30はダメ、7/10は良くなった) 方程式を使わないでとあっても方程式を使う。Bingは正解かつ方程式を使わない説明も完璧
/* 2023/06/30では不正解 */
x + y = 20 (1)

30x + 10y = 460 (2)

等式 (2) から10をくくり出すと:
10(x + y) = 460
x + y = 46 (3)
しかし、等式 (1) と (3) が矛盾しています。

/* 2023/07/10では正解! */

方程式1: x + y = 20

方程式2: 30x + 10y = 460
方程式2を整理すると、3x + y = 46になります。
3x + y - (x + y) = 46 - 20
2x = 26
x = 26 / 2
よって、x = 13 となります。



x + y = 20 (1) 30x + 10y = 460 (2)

(1)式から、y = 20 - x (2)式に代入すると、
30x + 10(20 - x) = 460 30x + 200 - 10x = 460 20x = 260 x = 13










よって、260点 / 20点 = 13回となります。


Sample Project for Fine-Tuning

Perform fine-tuning on arithmetic word problems using OpenAI's large-scale language model API.
As of July 2023, fine-tuning is not available for the GPT-4 model via the API (only available for use in Chat), so fine-tuning will be done using the GPT-3.5-based model called text-davinci-003.
Fine-tuning is a technique to improve the output accuracy by retraining only a part of the parameters of a neural network that makes up the large-scale language model, using a pair of prompt text specific to a certain domain and the ideal answer text, as previously explained.

In the "Chain of Thought" paper, arithmetic word problems involving the four fundamental operations were fine-tuned using a teaching dataset of 10,000 problems.
For inferring causality from correlation in fine-tuning, fine-tuning was performed using 400,000 problems (researchers investigated whether large-scale language models can estimate "true" causality from correlation (Can Large Language Models Infer "True" Causality from Correlation? Researchers Investigate with 400,000 Problems)).
Since this is a prototype, we will create 100 teaching data for only 1 pattern out of the 30 patterns of word problems and test it.
The project we will build as a sample will focus on word problems related to "passing calculation".
"Passing calculation" refers to problems where vehicles such as trains, cars, and ships pass through a certain point or facility, cross each other, or overtake each other, and it involves calculating speed, distance, and time.
For example, the following is an example problem (original problem).
If you prompt the above and ask ChatGPT or Bing, they will calculate the equation without any instructions, so the accuracy is reasonable.
The main purpose of this project is to generate explanations without using equations for elementary school students by adding the prompt "Please explain without using equations, specifically for elementary school students".
However, even with these instructions, the models sometimes continue to use equations, provide explanations that are too complex for elementary school students to understand, or give incorrect answers.

Therefore, the goal of this project is to improve the accuracy of generating explanations for elementary school students by fine-tuning the models. For the example problem mentioned earlier, we will use the following prompt and target text pair:
答え 5000m
The problem starts from "prompt" to "tunetext". The text after "tunetext" is the answer key text.

We will prepare 100 pairs of the above data and convert them into a JSON file. Then, we will pass 90 pairs to the ChatGPT fine-tuning API to generate an improved large-scale language model.
The majority of the instructional data is derived from actual junior high school entrance exam questions, but we cannot publicly release them due to copyright issues. Using the improved large-scale language model, we will compare the scores of the remaining 10 questions (original questions without copyright issues) between a model that did not undergo fine-tuning and a model that did undergo fine-tuning, evaluating the correctness of the answers and explanations.

We will submit the report on October 16, 2023.

Limitations of Prompt Engineering (Fine-tuning)

One significant research question regarding large language models is whether they can infer causal relationships, and a notable study on this topic is discussed in the article " Can large language models estimate the "true" causal relationship from correlation? Researchers investigate with 400,000 problems". Large language models excel at identifying correlation relationships, as they are trained on "fill-in-the-blank" problems. However, it is important to distinguish between correlation and causation. For example, there may be a correlation between tall people and high intelligence, but it does not mean that being tall causes higher intelligence. This is because the high intelligence of taller adults often reflects a different factor than their height, compared to shorter children.
In the aforementioned study, fine-tuning was conducted by providing 400,000 texts related to causal relationships, which improved the model's ability to estimate causation to some extent. However, the accuracy rate remained at 30-40%. It became evident that relying solely on textual information, without real-life episodic memories or sensory and emotional experiences, was insufficient (symbol grounding problem).

Furthermore, a fundamental issue with AI generated by large language models is that they probabilistically estimate the text within the context, based on surrounding text. While scaling up improves the accuracy of text estimation and allows for handling larger gaps, it is questionable whether complex inferences can be practically made with reliable accuracy through the purely probabilistic estimation of text within the gap.

3. How to Achieve 90% or More Accuracy = Combination with Existing Programs

When considering only the tasks of "text comprehension" and "text generation," large-scale language models are surpassing human reading comprehension and writing abilities. One method is to combine these "text comprehension" and "text generation" capabilities with "separate programs" such as equation processing systems, expert systems, database searches, and business software. Please note that the "separate program" should be an area-specific technique.
An example of an existing famous system is the combination of the equation processing system Wolfram|Alpha and ChatGPT (for more details, refer to "What Is ChatGPT Doing ... and Why Does It Work?" by Stephen Wolfram, the developer of Wolfram|Alpha, published in 2023). ChatGPT converts ambiguous user inputs into the equation processing language "Wolfram|Alpha," and the precise and 100% accurate equation processing system of Wolfram|Alpha processes the equations. ChatGPT then presents the output from Wolfram|Alpha in a user-friendly natural language.
The combination with existing programs is executed using the following steps:
You may have the question of what happens if data comes in that the "separate program" cannot handle. Details will be described in the sample project of combination with existing programs.

ChatGPT is considered impressive due to its high reading comprehension and text generation abilities. Even for arithmetic problems that only require the use of the four basic operations, there are problems that require considerable reading comprehension skills.
ChatGPT can solve problems like the following, which are difficult to grasp the meaning of, by cleverly designing the prompts..
渋滞の最初の 15 分間に数台の車が通り抜け、
残りの 15 分間にさらに 20 台の車が通り抜けます。 
列から 5 台の車が出口を出るので、
高速道路に元々 30 台の車があった場合、
最初の 15 分間に何台の車が渋滞を抜けましたか?
The reading comprehension and text generation capabilities of large-scale language models such as ChatGPT are comparable to, or even surpass, those of American university students. According to some surveys, the latest models perform at a level higher than 90% of university students.
However, when it comes to reasoning/fact-checking/planning, even with prompt engineering, the accuracy drops to around 30% for models equivalent to ChatGPT and around 50% for models using GPT-4, even for elementary-level problems.
The aim of this article is to utilize the advanced language abilities of university-level students and handle the infallible aspects of reasoning/fact-checking/planning separately through another program, enabling us to confidently embrace the errors made by ChatGPT.

Utilizing the superior reading comprehension and text generation abilities of large-scale language models

The application and dangers of the dialog-based AI "ChatGPT" that can fluently create misinformation With ChatGPT (GPT-3.5 edition), a correct diagnosis can be made with 50% accuracy by inputting the symptoms of a patient in an emergency room in about 600 words. Although this is too low for actual diagnoses in an emergency room, it is possible to achieve a diagnostic accuracy of over 50% in the future.

The reason why AI before ChatGPT couldn't manipulate words like humans "Rinna," the Japanese version chatbot released by Microsoft in 2015, can respond to short question and answer exchanges, but cannot follow along in exchanges with long text and quickly deviates from the context of the conversation. ChatGPT surpasses "Rinna" in all aspects including handling long texts, context, and natural text generation.

What is the successor to ChatGPT, "GPT-4," that passes the American Bar Examination? Exploring its possibilities and risks In a test called "MMLU" designed to measure AI's language inference capabilities, 14,000 multiple-choice questions from 57 categories including natural sciences and humanities were given. In GPT-3.5, the correct answer rate was 70.1%, while in GPT-4, it was 85.5% (for the English test). It is about 5% lower in Japanese.

"90% of Japanese university students can't compete with ChatGPT in terms of writing skills," the concerns of a columnist teaching at a university At the level of mistakenly giving a "良" (good) grade. Grades are ranked from the top as "優" (excellent), "良" (good), "可" (acceptable), "不可" (not acceptable).

Sample Project with Integration of Existing Programs

Replace the text analysis and solution generation parts of the program "Solving Math Word Problems and Providing Solutions for Elementary School Students" previously created using the Prolog language with ChatGPT.
The development requires the following three tasks:
  1. Fine-tuning ChatGPT to convert the input prompt = math word problems into an intermediate representation for the math word problem solver in Prolog.
  2. Improving the Prolog solver to fit the input/output requirements of ChatGPT.
  3. Fine-tuning ChatGPT to convert the intermediate representation of the math word problem output to a text presented to the user.
Once the system is ready, the program will be made public.

The structure of the Prolog program is simplified as follows:
Math Word Problem Solution Program using Prolog
Here is an example of processing using an actual math word problem.
Process Description and Input/Output DataExample
Math Word ProblemIf a total of 8 birds, consisting of cranes and tortoises, have a total of 26 legs, how many birds (how many cranes) are there? Note that cranes have 2 legs and tortoises have 4 legs.
Bottom-Up Chart Parser
Input Intermediate RepresentationAlmost the same as the system combining ChatGPT
Math Word Problem Solver using Prolog
Output Intermediate RepresentationAlmost the same as the system combining ChatGPT
Explanation Text and Diagram Generation
Please refer to developing a chatbot for solving arithmetic word problems using prolog for more detailed information. The system combining rule-based programming and ChatGPT developed this time can be simplified as follows:
ChatGPT+prolog arithmetic word problem solver program
Process/ Input and Output DataExample
Arithmetic word problemTsuru and Kame have a total of 8 animals and a total of 26 legs. How many Tsuru and Kame are there? Tsuru has 2 legs and Kame has 4 legs.
ChatGPT (fine-tuned to output intermediate representation)
Input intermediate representation
top_attr(sum,26,"number of legs","legs").
Prolog-based arithmetic word problem solver
Output intermediate representation
generate([When replacing all with Tsuru, the number of legs becomes 16.]).
generate([When 1 animal becomes Kame, the number of legs increases by 2.]).
generate([When 5 animals become Kame, the number of legs increases by 10 and becomes 26.]).
generate([The answer is 3 Tsuru and 5 Kame.]).
ChatGPT (fine-tuned to generate explanatory text based on input intermediate representation)
The intermediate representation is likely to be changed in the future. If it is changed, it will be updated in this document.
All the practice problems are actual entrance exam questions for junior high school, so they cannot be publicly shared due to copyright issues. Only the validation problems (10% of the total) are original problems and can be shared publicly.