LXT podcast – episode 5: Roberto Pieraccini - Chief Scientist at Uniphore | High-Quality AI Data to Power Innovation

In the fifth installment of Speaking of AI, LXT’s Phil Hall chats with Roberto Pieraccini, Chief Scientist at Uniphore, about the evolution of AI, speech recognition and more. From comparing Siri to 2001: A Space Odyssey’s Hal to discussing unsupervised learning in AI, Phil and Roberto provide valuable insights into the history and scalability of Generative AI.

Highlights:

Introducing speech recognition expert and industry-veteran, Roberto Pieraccini
Was Siri what scientists of the field hoped for to recover from all the frustration and unpopularity that speech recognition’s small and big earlier failures had raised in the popular culture?
Did Siri turn out to be the gentler version of HAL from 2001: A Space Odyssey that we’ve been waiting for, for more than 60 years?
How concerned are you about the villain potential of generative AI?
How would you characterize the contemporary value of AI systems, perhaps from a user perspective and then from a commercial perspective?
In terms of advances in machine learning and AI, does it still boil down to solving communication problems with elegant mathematical solutions?

Introducing speech recognition expert and industry-veteran, Roberto Pieraccini

PHIL:

Hello and welcome. My guest today is an electrical engineer, a speech scientist, and an industry veteran, a true industry veteran. When I read his list of employers, it’s like laying out a roadmap for the contemporary history of speech recognition and AI.

Bell Labs, AT&T, SpeechWorks, ScanSoft, IBM, SpeechCycle, Jibo, Google. And in addition to his current job as chief scientist at Uniphore, he’s a talented and successful author, photographer and musician. It gives me enormous pleasure to welcome as our guest today, Roberto Pieraccini. Hello and welcome Roberto.

ROBERTO:

Thanks so much, Phil. I know we’ve been talking for a while about this podcast, and I’m glad we made it happen, and thanks for inviting me, it’s a pleasure.

Was Siri what scientists of the field hoped for to recover from all the frustration and unpopularity that speech recognition’s small and big earlier failures had raised in the popular culture?

PHIL:

It’s great to have you here. So, Roberto, you’ve published two wonderfully informative and accessible books on the evolution of AI, 2012’s Voice in the Machine and 2021’s AI Assistance.

I don’t imagine that at the time of their publication, you could have foreseen that they would roughly coincide with two of the biggest landmarks in the technology’s progress. That is, in 2011, Siri’s full integration into the iPhone, which really did change things for speech recognition, I believe. And then in 2022, the public release of ChatGPT.

So, looking at Siri and its integration into the iPhone, when it was already apparent that Siri would be impactful and relevant, you asked some great questions, but ultimately left them unanswered. So, I’d love to hear how you would respond to your own questions today. From your perspective, was Siri a beginning or an end? And was Siri what the scientists of the field hoped for to recover from all the frustration and unpopularity that speech recognition’s small and big earlier failures had raised in the popular culture?

ROBERTO:

Thanks Phil, that’s a great question. As a matter of fact, when Siri came out, I think it was October 4, 2011, exactly one day before, unfortunately, Steve Jobs passed away, I was in a hotel in Melbourne, in your country. So, yes, it was a time we were actually working, it was a speech cycle, and we were working with Telstra, the Australian large telco, telecommunication company. And then I learned, I knew Siri because Siri came out as an app, independent app on iPhone one year earlier, one or two years earlier than that.

And I was trying, I was mesmerized by Siri. And then I heard that it was coming out soon on iPhone 4S, if I’m correct. And then I… the proofs of my book were already at the publisher, MIT Press. I already signed them, the final manuscript. I said, I cannot publish a book without a chapter on Siri. So remember in that hotel room, very quickly I wrote the last chapter, which is called Siri, What’s the Meaning of Life? And the publisher liked it and we inserted that.

So, the question that Siri was beginning or end was both, right? Siri was the end of decades of frustration in speech recognition for many reasons. Technology was not quite there. I started working in speech recognition in 1980, but speech recognition started earlier than that, as you know. And we were moving ahead and there’s all this, people making fun, speech recognition isn’t working, and the problem was that… The world knew about speech recognition through the, what they call IVR, interactive voice response application that somehow simulate contact center agents. And there were a lot of issues at the time. So first of all, people called agents when something was wrong.

So, they were not in the best of mood, right? You know, check bounces, the appliance you bought doesn’t work, your TV doesn’t work. And then instead of talking to a human, they talk to a machine. The machine did not always recognize them and so on. So Siri, I’d say was the end of that time, although the time is continuing today, right? It is still years away. We’ll talk probably a little bit about that later. But Siri opened up the technology of speech recognition to the whole world in a totally different manner. And it was for, you know, people were not captive on application, they didn’t want to talk, people could choose freely to use Siri for many things. And people started using it in more unusual way, asking where can I bury a body, I remember for a time was one of the most popular questions. Ask Siri, but also all that things like set an alarm and all that type of things. So that was a good thing, a good thing, right? Now, the problem was that it was very hard to know the limits of the voice assistant. Can I ask anything? Can I ask anything in the way I want, in any possible way, with any possible accent, in any possible noise situation? And the answer was no, and it was very hard to understand the limits, right? What I can ask and what I cannot ask.

And the second point, what are the capabilities? Can I ask to plan my next vacation? Can I ask to tell me what was that nice restaurant in Paris where I went three years ago, no, maybe it was four years ago with my wife and I had onion soup, right? Actually, we are laughing about that, except the onion soup. If you use Google Maps, Google Maps knows pretty much the restaurant you went to in Paris three or four years ago. So in principle, you could answer this question, but not at that time. So it was the beginning of a new era, the of the automated assistant or voice assistants, or as I called then in my second book, AI assistants. And then in fact, immediately after a few years, Alexa came out, Google Assistant came out, and Samsung Bixby came out. So that’s what I consider an epochal thing was a change of scenario for speech recognition.

Did Siri turn out to be the gentler version of HAL from 2001: A Space Odyssey that we’ve been waiting for, for more than 60 years?

PHIL

Yeah, absolutely. It is quite remarkable today with multiple choices. I’m sitting in my lounge room. I can actually call out any of those names and they’re there and ready to help. So the big question, in your books you often made reference to HAL, the personal assistant in 2001: A Space Odyssey. In fact, in the end, the villain in 2001: A Space Odyssey. So the question you asked, in your opinion, did Siri turn out to be the gentler version of HAL that we’ve been waiting for for more than 60 years?

ROBERTO:

Yeah, that’s a very good question. You know, when I started, when I was in high school, I went to see Kubrick’s 2001: A Space Odyssey. That was one of the things that really impressed me along with the moon landing. I want to be an engineer. This is great, this is so cool, right? You know, and then by chance ended up in speech recognition. I was on the streaming. Can we build a gentle version of HAL, not a villain version, but a gentle version. And I would say that comparing Siri with HAL is a little bit of a stretch, right?

You know, if you watch 2001: A Space Odyssey again, you’ll see that HAL has incredible capabilities, even today, we cannot do even with the latest ChatGPT and large image models… There is a scene where, Dave, the astronaut, is drawing something on a sketchbook. It was very boring life on the Discovery because everyone else was hibernated. There’s no one to talk to or very few people to talk to. And he was sketching the faces of his colleagues who were hibernated. And then he shows it to HAL, HAL asks, what are you doing there? Dave says, I’m sketching. HAL says, ah, nice rendering. He improved a lot in the past couple of months. Can you put it closer? And Dave puts it closer to the camera and say, ah, that’s Dr. Hunter, isn’t it? That’s amazing, right?

You cannot, and also the level of emotion, like when they play chess, of course, HAL cheats at a certain point. HAL is cheating, right? It’s calling a move, but it’s making another move, right? And so if you look at… carefully at the chessboard, but then say thank you for an enjoyable game, which is amazing, right? Of course you can record that, could be canned, right? But the level of expression, the level of things, it looks like, well, we are far from there, right?

Whether we’re gonna get to something like HAL, I don’t know, but today is a little bit of a stretch. I don’t think we are quite there. And whether we will or we want to be, something like that. You want to build something like that.

How concerned are you about the villain potential of generative AI?

PHIL:

Yeah, I mean, there’s been a lot written in the last, since launch of ChatGPT, which really awakened public consciousness of AI potential, generative AI potential in particular, that, how concerned are you about the villain potential, if you like?

ROBERTO:

Yeah, of course, being deep into that technology, I’m concerned, but not concerned for the, you know, the doom day when AI will imprison or will put an end to the human race, you know, or human species. I don’t think so. I’m not afraid of that. What I’m afraid about the use of AI for, by villains, by humans who are villains, right? For doing things that can cause a disaster.

Imagine interrupting the electrical power for days and imagine getting to the banks and using your money. These are real things that could happen.

So, like if you remember when we had virus, I don’t know, we didn’t hear about viruses anymore, since COVID right? But the computer viruses, but it was a continuous fight between the virus detectors and all the virus that runs on your PC and the virus. And every time there was a new virus, the virus detector came out with new solution. And I think that’s our future, right? We see, we need to create measures and guidelines and guardrails. It could be legal guidelines, and technology guardrails that prevent anything bad happen. That’s why everyone, including myself, is talking about responsible AI. We build AI, we need to be responsible about what we build. And that’s all the companies in the world are really on the same line. Of course, like in every case, the good people are on that, thinking that way. What we need to fear about is the people who are not as good as that. They want to do evil with that.

How would you characterize the contemporary value of AI systems, perhaps from a user perspective and then from a commercial perspective?

PHIL:

Yeah, no, no, absolutely. Thanks for that. I think people will find that answer quite reassuring, which is a good thing. So let me bounce back to the beginning of the 1990s. At that point, I think the way I saw it anyway is that the value proposition for speech technology was evident, but not particularly compelling, up until the point where Jay Wilpon and AT&T released the first meaningful voice automation system at scale. In your book, you referred to this and you said that there was a significant financial investment needed to support the emerging technology so it fulfills potential.

So, it was hardly surprising if at that time you could measure success in terms of the number of humans that can be replaced, for example. OK, so you showed that a simple five phrase application enabled six thousand layoffs. So that makes a financial, a compelling financial case, which perhaps didn’t exist before that event. And then later in the 90s, I love the phrase that there was a shift from carbon based transcription to silicon based.

ROBERTO:

Did it say that?

PHIL:

Yeah, I think so. So, yeah, so it doesn’t really surprise me that at that time, when people were trying to make a compelling case, trying to get the financial support that they needed, that they would make their case very strongly in financial terms. These are the savings we can make.

So, give us the money and we’ll make those savings for you. How would you characterize the contemporary value of AI systems, perhaps from a user perspective and then from a commercial perspective?

ROBERTO:

Thanks. This is a very, very good question. First of all, you mentioned Jay, Jay Wilpon, who is a great friend of mine. We are born the same year. We all love seeing each other. And if he listens to this podcast, I say hello to Jay. As a matter of fact, the AT&T guaranteed me, or Jay guaranteed me, that the 6,000 layoff never happened, but they found a way. So I was, I don’t know if that’s true or not, right? But I was reassured that Jay was working on this technology probably one or two years before I joined AT&T Bell Labs.

But it’s always the case, if you look at the history, the most evident case is the invention of the assembly line by Ford in 1913, I think. That displaced a lot of people, right? And any new technology does that. And we need to be, the whole society and the whole system needs to be aware of that, and needs to provide help for people at this place. People need to be trained on different jobs and different careers.

And so that’s a responsibility that we have as a society. But we cannot stop the… the progress of technology, right? Otherwise today we still use horses, right? Not Teslas to go to work, right? So, talking about this particular technology, the reason why it came out in the mid-90s or was because if you remember before there was the what’s called DTMF, press 1 for this, press 2, and then voice recognition in the mid-90s thanks to speech works and nuance, at least in the US, became very reasonably accurate. And we started using that. And the reason is not just to lay off people to save money, but because if you look at the consequence of the spending necessary for an enterprise to maintain a trained task force for customer care is huge. And what happened, the agents were poorly trained. So, they didn’t provide much value to the end customer. And since you don’t have an infinite number of agents responding to the demand of an increasing the amount of customers, customers will end up waiting in a queue with music for tens of minutes or half hours or more. And it’s not the best use of that time.

So, the idea to automate part of that, and to always provide a way to escalate to a human agent when the problem was hard to solve for the machine. In a sense, it was not just for saving money, but also for providing a better customer care. So today, I am back to that after Google, after Jibo, and I work at a company called Uniphore where that is one of the many value propositions that we provide. So, what you call self-serve, right? So building applications that somehow automate certain functions, not all of that, certain function of agent. Actually, there are more interesting applications than that where AI is helping. One is the agent assist. I told you that it’s hard to train agents and there’s a big turnover and attrition among the agents.

So, talking to an agent who is not trained doesn’t help. So, AI can provide support to agents, tell them what is the next thing to do, give them answers from the knowledge base that the user is asking. And also making summaries. Today we can do summaries quite well using the latest technology. Summaries are something every agent has to wrap up a call at the end of the call, it takes time. That means money for the company. So if summaries can be generated automatically for the agent to review and correct when it’s the case. Other applications analyze the tens of thousands, hundreds of thousands of calls that the company gets every day. Companies that are big enterprise don’t even know what the agents talk about, what the customer asking about and then new trends, new problems arising. So, if you can do that automatically, it’s a big, big improvement of the customer care situation for a company. And also, there are other applications that we are working on like support for sales, emotional detection in a sales goal to understand what is the sentiment of the clients and provides hints to the salesperson or when the clients are not engaged and so on.

So, there are a lot of applications, which is not just the horrible voice VR-like people. If you remember at the time there was a website, I want human dot com that tells, just to tell the tricks like for example, you call the AT&T, you get an automated machine, push zero three times, turn back. five times, say the magic words, and then operator will come, right? Yeah, so, actually it was a serious thing. It was a serious thing.

In terms of advances in machine learning and AI, does it still boil down to solving communication problems with elegant mathematical solutions?

PHIL:

Yeah, that’s a, yeah. I can imagine, yeah. Yeah, so yeah, I think it’s clear that that’s a really good explanation of the value proposition commercially, but also that from a user perspective,

these days, it’s not something that people aren’t going to go and look at a site like that anymore because they’re quite satisfied with what they get from the interactions and, and perhaps would choose it over human in some situations because they don’t have to engage, they don’t have to do small talk, they can just move right along.

Okay, in 2012, you framed AI as language, thought, understanding, solving communication problems with elegant mathematical solutions. And I’m interested in that – the elegant mathematical solutions part of this. Is that still how you view it today? In terms of advances in machine learning and AI, does it actually still boil down to solving communication problems with elegant mathematical solutions? Or is it now something different?

ROBERTO:

Yeah, that’s a very, very, very good question that gets into epistemology and philosophy, and I could talk for, I would love to talk for hours with you in front of a good Australian red wine.

PHIL:

Yeah, we can arrange that.

ROBERTO:

We can do that next time. But that’s, that’s very interesting. I’ve been very fortunate to live through the evolution of AI or machine learning. I like to use it until today. When I say that, I was referring in particular to speech recognition or ASR as people call it, because ASR was not invented because people started to try to find solutions to ASR. To this, as far as we know, in the 1950s, using analog computers, not everything analog machines with resistors and capacitors and tubes.

But in the 1970s, the work done at IBM Research by Fred Jelinek and Jim Baker, they came out with an elegant mathematical formulation of the problem of speech recognition. And it’s the only equation that I have in my book, which is the equation that everyone who works in speech recognition should at least know that puts out the problem, solving the problem of speech recognition and solving the problem of the optimal receiver in presence of noise. We have thoughts, we express our thoughts with words, the words get into a noisy, and the noise is also the different variations that we use to talk, different dialects and so on. And then it gets to the ear of a listener, and then you can express it mathematically. And the noise, you want to get the best possible sequence of words given the noise. And the equation highlights two important things that have been with us until I would say 10 years ago, and this, the acoustic model and the language model. So we knew language models way before language, large language models came to. New language models since then, the end of the 1940s with Claude Shannon talking about that in his famous seminal paper on information theory.

So, so what happened? What happened today? So if you look about what happened today, I would like to cite one of my philosopher heroes, is Daniel Dennett, who wrote in a recent book that makes the parallel between the evolution theory, Darwinism, and Alan Turing, the Turing machine. So, the evolution theory taught us that you can build complex, sophisticated organisms without comprehension. We don’t even, the evolution theory does

not understand how a virus, a frog or a human being work, but that happens to an algorithm that is the survival of the fittest. So many variations of that, right? So it could talk about competence without comprehension. Alan Turing showed a very simple machine can solve all the problems. Of course, it’s a virtual, theoretical machine.

And today we have large models. Language, large language models, don’t have the modus inside. They take the speech, talking about natural language understanding, take the string of words, find the adjectives, find the nouns, find the verbs, makes an hypothesis about, there’s nothing of that. It’s just a mass of artificial neurons that have been trained into so much text that they have been training only to predict the next word. And it’s amazing how just predicting the next word or the next token, they have a behavior that seems like intelligent. I say seems because I don’t believe it’s totally intelligent in many people, but it demonstrates some rationality and some intelligence and makes a lot of mistakes sometimes. Right, and again, it’s competence without comprehension. The individual artificial neurons inside the machine don’t understand anything, don’t have comprehension. But the massive thing, the neurons in our brain eventually get to a competence level.

Is there an absence of design in generative AI?

PHIL:

Does that suggest that there’s an absence of design at this point as well?

ROBERTO:

So, I like to talk about ChatGPT and large language model, large language model is not just ChatGPT, but when I was at Google, I was working on Lambda and MENA and the evolution actually wanted to be going to Bard and Gemini. I like to say about the history of AI, like the end of the intelligent designer. So, there’s no intelligent design. It’s not totally true. In order to design a ChatGPT there is a lot of engineering behind there, right? And the same for all that I was talking about. But we don’t design modules. We don’t design algorithms. We don’t design what we used to do in MIR 12, 13 years ago, right?

PHIL:

Right, so yeah, so there is a very large phase, perhaps the largest phase in which there is an absence of design.

ROBERTO:

Yes, yes, yeah. And see, we need to, we are the beginning of that. What I call actually the democratization of AI, right? Many people call it that because everyone can use ChatGPT to create interesting application. Even if you don’t have started natural language understanding, speech recognition, and so on.

Is unsupervised learning the holy grail for speech recognition?

PHIL:

That’s great. Which is a nice lead into the next question I have here. So, you noted that unsupervised learning is the holy grail for speech recognition. I think it’s safe to say that this is, today, perhaps the holy grail across the entirety of AI and machine learning. To what extent do you think it’s possible and practical to achieve this holy grail? Is there a limit that you think it might be impossible to cross?

ROBERTO:

I think we have to cross the limit. If you want to build machines that are more and more competent, we need to be able to train them with more and more data and more and more multi models. They’re not just text. The limit of today’s large language models is that they have seen only text. Imagine a brain in the bat that’s only read the web, never touched a cold surface, it never tasted the taste of a lemon and so on, right? So there is a limitation to that, right? So, we need, because we cannot use what we used to do before annotations, right?

Annotate curated data. We need to curate the data in a way to avoid redundancy and multiplication, clean up data from a lot of things. But that can be done programmatically. When I started speech recognition in the 1980s, we had to record words and make sure that they were the words that were tagged and so on. We cannot do that anymore. Also, there is a more strict guidelines of privacy. I cannot use recordings freely, right? I cannot, right? And in fact, we need to think about how… And you know, large language models are trained mostly in a supervised manner. They learn how to predict the next token. So it’s easy to take text and mask, remove the next, once at a time, remove the next token and ask the large language model to predict it.

So, that’s one thing. The other thing is that we see more and more generation of synthetic data that is helping, that happened a few years ago with speech recognition. So we can generate data, we know how to generate speech. We can generate pattern between noise and variations and so on. So this is the new world and this is what we need to, these are the problems. So now we are seeing a lot of use of the human in the loop in what is called reinforcement learning with human feedback. That’s very useful, but it happens on a limited amount of this. We want to make sure. And that’s useful for many things like working on hallucinations, bias, and safety. So, I still see that, but it’s not the big majority of the big amount of data is going to be unsupervised.

Scalability issues, increase in volumes of data, human-like performance of generative AI, and human in the loop: Are there near term solutions for scalability concerns around these topics?

PHIL:

My next question does dig a little bit deeper into that. So over recent decades, the volumes of data used have increased massively. And you’ve expressed concerns about how this impacts scalability. So, I’ll just read a fairly lengthy quote here and then there’s a few questions.

“What plays a big role in the more difficult attainment of human-like performance in language, understanding and generation, is that even today we still need to rely on representations of meanings such as intents and arguments which are not naturally available to us and these need to be crafted on a case-by-case basis and crafting an abstract representation requires a lot of work that hardly scales to cover all possible meanings and their variations.” Now, I’m sure at the time that you wrote that that made a lot of sense. What’s happened in this time since then has probably obviated that even further.

So, do you see a near term solution to this scaling problem? Do you think that those case-by-case crafting that you talked about, that that can be automated? But if it is automated, does the issue of getting human-like performance extend to some of the bigger issues, such as hallucination and elimination or management of bias? And can this be achieved without human in the loop?

ROBERTO:

That’s a great set of questions. Let me start saying that in my opinion, and of course, all of this is my opinion, anytime we tried to impose a human invented representation on a machine, on a language machine, speech machine, I’m not sure I’m going to understand. we didn’t get great results. And a clear example is picture recognition. Until, I would say, 12, 15 years ago, we imposed the phonetic transcription. And we, we couldn’t get a more certain, why? This means to a certain extent, and I know I will attract the irate responses of linguists. It’s a human invention. If you’re going to take someone in the streets, and say, can you tell me what are the phonyms? People don’t know. We know the words, and we know how to pronounce them, but nothing in between,.

So in fact, today, speech recognition does not need to have a phonetic transcription. It creates its own phonetic theory, which is very amazing, right? One of the layers, or the many layers, so that we’ll have something which not exactly correspond. By the way, linguists fight, right? About what is the correct presentation of phonyms and so on, right?

Now, the same thing with intents. Intents are a human invention. The human invention of intents, especially for a visual assistant, it’s very important, necessary. Why? Because eventually, if you ask a human, a virtual assistant to do something, it has to call the API, what we call the API, the function that defines a functionality. If I say, what’s the weather tomorrow in Sydney? Eventually, once I understand this, I had to create a HTTP request to the weather.com site and get the response and interpret the response, right? We all created intermediate because it could be weather.com or could be another site. Intermediate representation that is weather, open parenthesis Sydney, comma, tomorrow, close parenthesis.

But now that creates a scaling problem. The API exists already because people will be at this website, they will create an API. We need an API for everyone to be able to query that thing. And then we want to create… and that requires a lot of people, a lot of engineers, defining these intents and arguments or entities, how we used to call them. Now, if you look at things that, you know, Bard, Gemini, and ChatGPT, they don’t have intents. Why? Because they’re able to provide the answer to a question exact without going to intermediate phase.

Now the problem, if I ask what’s the weather in Sydney, they don’t know that, because they were trained three months ago, six months ago, and they may knew what the weather in Sydney was, but it’s not of use to us, right? So, there are somehow they have to know the API, what we call the API or the function call and how to invoke it. And then I see we are moving in the direction where with the GPTs and agents, and so on, we can do that by providing to the chatbot or to the large-language model the knowledge of the API without designing and engineering an intermediate representation. So, I think there is a hope and I think probably the world is moving so fast right now that many of the things we do today don’t require the design of an intent and argument – intent and argument schema.

Can the human in the loop be taken out?

PHIL:

Yeah, that’s, that’s, so. And therefore the human in the loop can be taken out?

ROBERTO:

No, the human loop, as I said in one of my previous answers, is still to validate. You mentioned hallucinations. You mentioned safety. Safety is an important problem, what we call the alignment problem, aligning the values of AI to the values of humans, like safety, bias, fairness and so on, right? We don’t want a chat bot, like ChatGPT to be abusive, to have abusive language. And we don’t want them to give advice on how to do violent things, to build bombs or to kill people, right? Or simply advise on health, right? You ask, I have a headache, what should I do? They should not. They should tell you, go to a doctor. And not allowed to give you health, I think, right? But they do that.

So, all these things which go under the umbrella of responsible AI and safety and trust, they still require some form of human in the loop. And I said that we could have like quality assurance loops where some humans interact and say, oh, this… not to be safe. It’s an abusive answer. I mark it as abusive. I believe, and I probably see some article where we start doing that automatically. Imagine we have a much more expert language model, expert exact on safety issues that could actually correct, teach the other model. This is not very good to do, but this is still a big research issue and not a problem.

Can we predict the future of AI?

PHIL:

Okay, well, I have just one last question. And that question is, if you were running the interview and not me, what is the question that you would ask yourself? Is there something really important that I’ve forgotten to ask?

ROBERTO

Ah, that’s interesting. You asked exactly the questions that I would ask myself because probably you are, you read books, thank you for reading my books. And you got to know me and to know my interests and my points, the points that I was trying to make there. I don’t have a specific question to ask myself. I know probably, the question is what will happen in the next five years? And the answer is somewhat that some of the some great scientists gave. Making prediction is very hard, especially ahead of time. So, so, I mean, who could predict the ChatGPT 20 years ago, 20, no one did, right?

PHIL:

My wife and I, my wife and I often sit down and say, you know, talk about could we have predicted where we are today if we looked just two years ago? And usually the answer is across a lot of big things there’s something that… that we would never have imagined, places

our lives have gone.

ROBERTO:

It’s interesting, we had a New York City party with some friends here and I ran a game that I’ve done in another party, say, let’s predict what will happen at the end of this 2024 and let’s meet again at the end of 2024. I wrote down the predictions actually, actually. And the prediction was about political status, about the AI and all this stuff, the economy and stuff like that. We’ll see how good. But who could predict COVID and how Mel Brooks would say, who could expect the Spanish Inquisition? No one did expect that.

PHIL:

Well, Roberto, can’t thank you enough for your time and your patient and insightful responses to the questions. It was really great to spend this time together and to dig a little bit deeper

into things that I think are really the important topics in our industry today. So huge thank you.

ROBERTO:

Thank you so much, Phil. I know that we have been dating for some time before we actually took, I probably started a couple of years ago and I changed jobs and all these other things,

but it’s great. I really enjoyed. Thanks a lot for the great questions. Next podcast, we do it in front of a good glass of Pinot Noir from from some of the great wineries in Australia or in Italy.

PHIL:

They’re both great suggestions and perhaps we can talk about art and photography in the next session.

ROBERTO:

That would be great. Thank you, Phil.

PHIL:

Thanks again Roberto, take care.

LXT podcast – episode 5: Roberto Pieraccini – Chief Scientist at Uniphore

Highlights:

Introducing speech recognition expert and industry-veteran, Roberto Pieraccini

Was Siri what scientists of the field hoped for to recover from all the frustration and unpopularity that speech recognition’s small and big earlier failures had raised in the popular culture?

Did Siri turn out to be the gentler version of HAL from 2001: A Space Odyssey that we’ve been waiting for, for more than 60 years?

How concerned are you about the villain potential of generative AI?

How would you characterize the contemporary value of AI systems, perhaps from a user perspective and then from a commercial perspective?

In terms of advances in machine learning and AI, does it still boil down to solving communication problems with elegant mathematical solutions?

Is there an absence of design in generative AI?

Is unsupervised learning the holy grail for speech recognition?

Scalability issues, increase in volumes of data, human-like performance of generative AI, and human in the loop: Are there near term solutions for scalability concerns around these topics?

Can the human in the loop be taken out?

Can we predict the future of AI?