←Back to posts


How did garbage come in, but quality come out in training ChatGPT?

2023-02-27

AI

For those who may not be familiar with ChatGPT, it's basically a question-answerer. It can give answers in English, Python, or some other forms. It's still new, and much is yet to be seen from ChatGPT to fully appreciate its real worth, but I have already observed many promising results from ChatGPT. The question for me was, how on earth did the machine selectively learn good writing/coding practices from a huge corpus of junk text which we call the internet?

Even as a firm believer in deep learning's disruptive potential, I had my doubts about AI becoming an exemplary coder in the near future. While the internet has a lot of high-quality code, it has far more mediocre code written by second-rate programmers. I thought, if AI learned from the entire corpus of internet text, it could become eloquent, but the code it produced would also be filled with many bad practices. And I couldn't have been more wrong. At least in terms of coding style and clarity, ChatGPT produces good code, not mediocre code. How was this possible?

There is this famous Tolstoy quote: "All happy families are alike, but every unhappy family is unhappy in its own way." I think the same can be said for many other things. There are many ways to speak correct English, but there are far more ways to speak broken English. Take also dancing. It's actually much harder to replicate amateur moves than professional ones, since there are so many different kinds of subtle quirks in them. And, for a fact, coding is no different.

Learning something is, by and large, about finding patterns. It is no exaggeration to say that nothing can be learned if you cannot recognize any patterns from what you try to learn. I think it's actually much easier to learn good coding practices even when the internet is filled with more bad practices, since good code is full of simple and consistent patterns while bad code is not. This is how garbage went in, but quality came out in training ChatGPT.


ChatGPT e.g. 1

Image
Image


ChatGPT e.g. 2

Image


Image



Source