Dialogue Generation AI Pioneer Schmidhuber: Reflections After Not Receiving the Turing Award

Truth is like sunlight; although it may be temporarily obscured by dark clouds, it will eventually break through the gloom and shine brightly.

Here is the English translation of the provided text:

LSTM was considered "the most commercially valuable AI achievement" before the advent of ChatGPT.

However, Schmidhuber wants people to know more about the years 1990-1991, which he compares to the "miracle year" in physics (1905). According to him, during that period, he laid the foundation for "generative artificial intelligence" by introducing GANs (Generative Adversarial Networks), non-normalized linear Transformers, and self-supervised pre-training principles. This had a broad impact on the "G," "P," and "T" in ChatGPT.

Therefore, even before the deep learning trio (Geoffrey Hinton, Yoshua Bengio, and Yann LeCun) won the Turing Award, Schmidhuber was already dubbed the "father of mature artificial intelligence" by The New York Times. Elon Musk also praised him on X, saying: "Schmidhuber invented everything."

In 2013, Schmidhuber was awarded the "Helmholtz Award" by the International Neural Network Society (INNS) to recognize his significant contributions to machine learning. In 2016, he was awarded the IEEE Neural Network Pioneer Award. He currently serves as the Scientific Director of IDSIA, an AI lab in Switzerland, and as the head of the AI program at King Abdullah University of Science and Technology (KAUST) in Saudi Arabia. He is also involved in the operations of several AI companies.

This raises a new question: why hasn't he won a Turing Award yet?

Professor Zhou Zhihua, Dean of the School of Artificial Intelligence at Nanjing University, offers a noteworthy perspective: "In terms of contributions to deep learning, Hinton undoubtedly ranks first, with LeCun and Schmidhuber both making significant contributions. ### But HLB are always bundled together. Winning awards requires nominations and votes, and personal relationships are also important. However, it doesn't matter; with a textbook-level contribution like LSTM, he can remain calm."

During the two-day in-depth conversation with "Jiazi Guangnian," Schmidhuber, with his signature stylish black beret and fluent German-accented English, presented himself as a scholar with both humor and approachability. However, beneath this amiable exterior lies an indomitable spirit, eager to establish scientific integrity in the rapidly developing field of AI research.

When discussing the overlooked contributions of himself and his academic colleagues, especially the groundbreaking achievements of small European academic labs before tech giants, Schmidhuber's words reveal an urgency to correct the historical record.

Over the past few years, he has engaged in multiple public debates with LeCun, Ian Goodfellow, and others on social media and at speaking events, using well-prepared and peer-reviewed arguments to accuse others of "reheating" his earlier published work, arguing that the recognition due to early pioneers in the deep learning field should not be diminished.

His outspokenness naturally leads to controversy about his personality. However, Schmidhuber's perspective, rooted in Europe and academia, indeed provides the public with valuable diverse viewpoints beyond the potentially misleading mainstream narratives from Silicon Valley. Moreover, he not only persists in speaking for himself but also tirelessly commends his outstanding students and those underestimated contributors in the development of AI, striving to give them due credit.

Regarding the debate over who should be called the "father of artificial intelligence," Schmidhuber points out that ### building AI requires an entire civilization. And the concept of modern AI had already emerged, driven by mathematical and algorithmic principles, decades or even centuries before the term "artificial intelligence" was coined in the 1950s.

As for negative comments directed at him personally, Schmidhuber appears more nonchalant. He often quotes famous singer Elvis Presley: "Truth is like the sun. You can shut it out for a time, but it ain't goin' away."

In this article, "Jiazi Guangnian" interviews Jürgen Schmidhuber, discussing the origins of artificial intelligence long before 1956, his own research and views on the "three giants of deep learning," and looking to the future. He believes that a machine civilization capable of self-replication and self-improvement may emerge. On the path to AGI, he believes that in addition to large companies, someone without much funding can also bring comprehensive innovation to AI research.

1. A Better Architecture Than Transformer

Jiazi Guangnian: Let's start with the history of artificial intelligence. You have a deep understanding of AI development. What aspects of AI history do you think need clarification?

Schmidhuber: There are certainly many. The beginning of artificial intelligence was much earlier than the Dartmouth Conference in 1956, when the term "artificial intelligence" first appeared. In fact, as early as 1914, Leonardo Torres and Quevedo had already designed an automated device capable of playing chess. At that time, chess was considered the exclusive domain of intelligent beings. As for the theory of artificial intelligence, it can be traced back to Kurt Gödel's work from 1931-1934, when he established the fundamental limitations of AI computation.

Some people say that artificial neural networks are a new thing that emerged in the 1950s, but that's not true. The seeds of the idea were planted more than 200 years ago. Gauss and Legendre, two genius teenagers, proposed concepts around 1800 that we now recognize as linear neural networks, although they called it "least squares method" at the time. They had training data consisting of inputs and desired outputs, and adjusted weights to minimize training set errors in order to generalize to unseen test data, which is essentially a linear neural network.

This is what we now call "shallow learning," so some people think that more powerful and novel "deep learning" is an innovation of the 21st century. But that's not the case. In 1965, in Ukraine, Alexey Ivakhnenko and Valentin Lapa pioneered the first learnable deep multi-layer network. For example, Ivakhnenko's 1970 paper detailed an eight-layer deep learning network. Unfortunately, when others later republished the same ideas and concepts, they didn't cite the Ukrainian inventors. There are many cases of intentional or unintentional plagiarism in our field.

Jiazi Guangnian: You yourself have played an important role in the history of artificial intelligence. Can you tell us about that miraculous year of 1991? What contributions did your research make to the AI industry at that time?

Schmidhuber: 1990 to 1991 was our time of creating miracles, which I'm very proud of. In just one year, we nurtured many core ideas that support today's generative AI in our lab at the Technical University of Munich.

Let's start with ChatGPT. The GPT in its name stands for Generative Pre-trained Transformer. First, let's talk about the G in GPT and generative AI. Its roots can be traced back to ### the concept of generative adversarial networks that I first proposed in 1990. At that time, I called it "artificial curiosity," where two neural networks playing against each other (a generator with adaptive probabilistic units and a predictor influenced by the generator's output) use gradient descent to maximize each other's losses in the game. However, ### in a minimax game, the generator tries to maximize what the predictor is trying to minimize. In other words, it's trying to "fool" the opponent by generating unpredictable content to challenge the predictor's limits. This technology was later widely used in the field of Deepfake.

As for P, the "pre-training" part of GPT, I also published about this in 1991. I found that unsupervised or self-supervised pre-training can greatly compress sequences, thus facilitating downstream deep learning of long sequences (such as very long texts).

T stands for Transformer. Some people think it was born at Google in 2017, but in fact, I had already introduced a variant of this concept in 1991, called the "fast weight controller," one variant of which is now known as the "non-normalized linear Transformer." This early Transformer was extremely efficient, requiring only 100 times the computation for 100 times the input, rather than 10,000 times like current Transformers.

Jiazi Guangnian: Many people, including the authors of Transformer, have stated that we need a better architecture than Transformer. It's certainly not perfect, so what do you think the next generation architecture should look like?

Schmidhuber: Now, improving Transformer efficiency is a hot topic, and my 1991 design is undoubtedly an excellent starting point.

For discussions about the next generation of LLMs, we can go back to the initial stage. At that time, both Google and Facebook were using our Long Short-Term Memory networks, or LSTM Recurrent Neural Networks (RNNs), which can be traced back to the 1991 thesis of my outstanding student Sepp Hochreiter. This thesis not only described experiments with the aforementioned pre-training (the P in GPT) but also introduced residual connections, which are core components of LSTM, allowing for very deep learning and processing of very long sequences. ### I proposed the name LSTM in 1995, but the name isn't important, what's important is the mathematics behind it. It wasn't until the late 2010s that LSTM was replaced by Transformer, because Transformer is easier to parallelize, which is key to benefiting from today's massively parallel neural network hardware (like NVIDIA's GPUs).

Jiazi Guangnian: Can RNNs solve tasks that Transformers can't?

Schmidhuber: In principle, it should be more powerful. For example, parity checking: given a bit string like 01100, 101, or 1000010101110, is the number of 1s odd or even? It looks like a simple task, but Transformers can't generalize it. However, even simple RNNs can solve this task.

Recently, Hochreiter's team developed an impressive LSTM extension called xLSTM, which has linear scalability and outperforms Transformers in various language benchmarks. Its superior understanding of text semantics