Sveriges mest populära poddar

One CA Podcast

1: Jon May: Artificial Intelligence for HA/DR Operations - LORELEI

41 min • 8 april 2018

Please welcome Jon May, Research Assistant Professor of Computer Science at the University of Southern California.

Dr. May describes his work on a DARPA-funded artificial intelligence project called Low Resource Languages for Emergent Incidents (LORELEI) and its connections with HA/DR operations for Civil Affairs.

One CA is sponsored by the Civil Affairs Association.

Hosted and edited by John McElligott.

---

Transcript

00:01:00    Introduction
and welcome to the 1CA podcast. My name is John McElligot. We're joined today by Jonathan May. He received his PhD in computer science from USC in 2010. Prior to rejoining USC and the Information Sciences Institute in 2014, he was a research scientist at SDL Language Weaver. John's researching areas include language, a natural language processing, specifically machine translation and semantic parsing. and formal language theory. Dr. May, thank you very much for your time. Thanks very much for having me. It's great to be here. Sir, before we dive into the program that you're working on and how it relates to humanitarian assistance and disaster response and civil affairs branch of the military, we want to go through some of the basics of what your field entails. So if you could go into more detail about your background and the natural language processing field. Sure, great. I was a computer science major in college, and I started to become very interested in artificial intelligence.

00:02:09    SPEAKER_04
intelligence. I thought it was really cool that, you know, we could build systems that could, you know, try to be, you know, mimic the brain sort of, or play games against humans. And in particular,

00:02:23    SPEAKER_04
I like the idea of, I discovered this field called natural language processing. which is really about how humans and computers can talk to each other, really how computers can understand human language and then produce human language and everything that that entails.

00:02:44    SPEAKER_04
And today you see a lot of natural language processing, or it's also sometimes known as computational linguistics, in your day -to -day life. So if you're just using, say, Google and typing a search query there, you're just... You're using your own words to try to figure out what you want,

00:03:00    SPEAKER_04
want, and then a computer algorithm somewhere is trying to find a web page that's responsive to you. So that's natural language processing right there. Other areas are determining when you spelled a word wrong. A kind of classic example is Siri, who's listening to you speaking,

00:03:19    SPEAKER_04
to you speaking, understanding the speech patterns and turning those into words and understanding what those words are supposed to mean and then trying to give you an answer. automatic translation, which is, you know, where you've got some Chinese webpage and you want to figure out, you know, what does this mean?

00:03:36    SPEAKER_04
You know, maybe it's a train ticket booking page. You need to figure out how to buy your tickets and they don't have the data. Somebody didn't write a translation, so you have to automatically translate these words. And then you can actually engage in commerce there, even though they don't speak your language and you don't speak theirs. So I love all that stuff. It really is. It seems to me like a great way to, particularly translation, to unify the world. So we're all kind of speaking one language together. And yeah, there's lots of great accomplishments that have happened over the past 20 years or so. And I think there's a lot more still to be done. It seems to be a field that's advancing at a rapid pace right now. Yes, yes. In particular, the field has really been around about as long as computers have been around. Pretty much, you know, the early development of computers that were at the end of the Second World War were first used for calculating missile trajectories,

00:04:31    SPEAKER_04
of the Second World War were first used for calculating missile trajectories, but then the second use was trying to do automatic translation. In particular, like in the early 50s, the U .S. was particularly keen, of course, on translating Russian. And this was way back when, but it wasn't very good for a very long time. But in the modern era, we have... volumes of data available to us and really sophisticated fast hardware that's able to process this data and so we're able to take advantage of all this data and learn statistics about the data to help us that have led to lots of gains really practical gains and in the past say five to seven years in particular

00:05:21    SPEAKER_04
You've probably heard about these advent of deep learning, which is the use of this particular kind of technology called neural networks. And they have really led to some really stunning developments. Now, sometimes it can be hard to tell whether you're talking to a computer or a human. Wow. And so it's fascinating. And I wanted to ask you about a question that... was included in a brief that you had provided to some civil affairs troops recently. The question was, can we leverage artificial intelligence or AI to respond to disasters around the world? What inspired you to ask that question? I want to give credit to DARPA for really asking that question before I did. But I saw,

00:06:07    SPEAKER_04
well, I think they saw, and we all saw it together. I was working for... this machine translation company after graduation in 2010.

00:06:16    SPEAKER_04
And I remember, so this was a company and we were providing translation, many different kinds of languages to companies and also for some government projects and also to help human translators actually do their job better. And I remember there was the earthquake, I believe, in Haiti. And it was a big humanitarian crisis. Most of the people in Haiti, of course, speak Haitian Creole, which isn't a language that we've historically spent efforts on trying to build automatic translation systems for. There's not a lot of data. There's not too many people that actually speak Haitian Creole, the population of Haiti,

00:06:56    SPEAKER_04
which is relatively small. But I asked my boss at the time, I said, you know, is there anything that we could do? I feel like maybe we could be of some service. And he said, well, I don't think there's much we could do. I mean, you know, these people are in a crisis situation right now. And it takes us quite a bit of time to gather enough data to build a system. And even building the systems takes some time. And by the time we're ready to deploy a translation system to maybe connect, say, USAID providers with the people on the ground who are maybe texting out their requests. It's going to be too late. So we didn't do anything, but there were people who did. And there was a program where they went down,

00:07:37    SPEAKER_04
they went down, and there was a team of people who did what I do. But they also brought in native Haitians, expats, and they were trying their best to use what technology they could and also just kind of scramble to translate these things as fast as possible. But it was kind of like it would have been better if they prepared this sort of thing ahead of time.

00:08:00    SPEAKER_04
Well, prior to that, we had done, I worked on a team, I think, back in, I want to say, 2003. And we were looking into, you know, if we needed to develop a system in a new language for translation or for, sometimes translation is fine, but you actually typically get lots and lots of data thrown at you all at once. I think analysts can receive, you know, tens of thousands of documents that they have to sift through a day. And just translating them all is not really necessarily going to be that great. There's other techniques that are part of natural language processing, which is understanding the most important parts of a document, trying to provide a summary, or just identify the names of the people, the places, and maybe the events that are happening in a big picture to allow some triage to happen. So we wanted to know, could we build those systems? If we just learned about a language, and somebody said, okay, go, build a system, what could you do in 30 days? And back in 2003, we tried doing this.

00:08:58    SPEAKER_04
doing this. And I was really kind of taken by how surprisingly well we were able to do with the language at the time, the Cebuano,

00:09:06    SPEAKER_04
which is... Where is that spoken? I think it's in the Pacific, in the Pacific Islands region,

00:09:15    SPEAKER_04
and I should look that up. Give me a second,

00:09:19    SPEAKER_04
if that's all right. Maybe Papua New Guinea or someplace like that? So, I'm sorry. The Philippines. is spoken in... Yes, it's an Austronesian language, so it's native to the Philippines. It's the second most spoken language in the Philippines after Tagalog.

00:09:40    SPEAKER_04
It should have been fresh around now. But anyway, yes, so it's spoken in the Philippines. But I hadn't studied it before, and most of our team hadn't. And, you know, we did a pretty good job. It was kind of surprising how well we were able to do without too much specific Cebuano data, and we didn't talk to any Cebuano experts. And so this kind of, I think this idea was sort of stirring around, and then after 2010, at DARPA they came out with this program, which was about,

00:10:12    SPEAKER_04
the name of the program was called Lorelei, and it was about trying to be responsive to the humanitarian aid and disaster relief needs when you don't have a lot of resources available. in terms of data and in terms of time. So given very limited data in the language that you need to build a system for and given a very limited amount of time,

00:10:34    SPEAKER_04
very limited amount of time, really ideally 24 hours is what they're aiming for. What kind of systems can we build? What kind of technology can we build? And so that's been a major focus for me and for a number of researchers actually around the world over the past few years. And it's been great because we really... We get to work with people who speak the language but aren't experts in linguistics or experts in computer science,

00:10:57    SPEAKER_04
speak the language but aren't experts in linguistics or experts in computer science, and they teach us about their language in this really limited time frame. And we're able to build surprisingly sophisticated systems. It was surprising to me at first, actually. And, you know, if you have a little more time, you do a little better, but when you don't have a lot of time, you can still do pretty well. I think there's also been some nice interest in deployment. in various agencies. So it's been a pretty nice story.

00:11:28    SPEAKER_04
story. Right. Yeah, I think 24 hours is very fast for anyone, but especially for civil affairs and for the military, unless we happen to be on the ground or in country already, if there was a natural disaster or outbreak or some kind of man -made event, it would take a little bit longer for most teams to respond. But if USAID or some other assets were already, you know, on their way as a Dart team, for example, then we would be coordinating with them and having a system like this in place would be very helpful. Well, it's really great to hear that 24 hours is a little too fast because, to be honest, if you wait a week, it's a lot better. So, you know, we can do some early triage, but then actually the more we... The more we see how we're doing at the beginning, the better our systems can get. So in our early days, we did give ourselves up to a month. And by the time you're done with a month of training, you've actually got a fairly usable system. It's still not at the same level as, say, like a French -English translation system where we've got 100 billions of words of French and English, and we've been studying that problem for years and years.

00:12:47    SPEAKER_04
We do pretty well, and we learn more insights on the language over the time, too. So our first year, we were working with Uyghur, which I'm actually kind of pronouncing wrong. I think it's more like Uyghur. But this is a language that's spoken in China, in the Xinjiang region, which is in the northwest. So it's spoken by an ethnic minority. It's a Turkic language, actually. It has no relationship to... to Mandarin. And it's, you know, so we were working with Royale and we realized after a few days,

00:13:22    SPEAKER_04
maybe a week of working with it, that hey, you know, this language is actually quite similar to some language that we've already got data for. And we had a lot of Uzbek data. And so we were able to develop techniques for pretending that the Uzbek was Uyghur and actually transforming the Uzbek into Uyghur. And now... increases the amount of data that you've got available. And this is kind of a major part of this program, is trying to look around and see, you know, even though you don't have a lot of resources in the language that you care about, if you have a lot of resources in other related languages, you can figure out what those related languages are. Can you leverage those? Right. And furthermore, you know, there's, to some degree, all languages have things in common, right? So even though...

00:14:09    SPEAKER_04
Chinese and English might seem very, very far apart from each other, and in many ways they are. There's still kind of common understandings that underlie all languages, and you can take advantage of these things too. So there's kind of like language universal ideas.

00:14:25    SPEAKER_04
So if you have a bunch of news data, say, and it's in some language, you don't know this language at all. Maybe you're not even told what the language is. You can still assume that people are probably going to be talking, at some point about dates, right? You know, days of the week or months or years. Right. And, you know, we do tend to have, to segment our...

00:14:51    SPEAKER_04
calendar into, you know, roughly four -week chunks. And so there's, you know, about between 28 and 31 days in every month. And so you can kind of pick up on these common regularitie

Senaste avsnitt

Podcastbild

00:00 -00:00
00:00 -00:00