with CLIPREVIEWED learn the articleHow a Google Engineer Used Her AI Smarts to Create the Ultimate Family Archive
COVID-19 lockdowns perhaps gave a few of you some time to organize old photos that have been languishing on SD cards or in boxes, but how many of you built an AI-powered searchable archive of family videos from almost 500 hours of footage?
Dale Markowitz, an Applied AI Engineer and Developer Advocate at Google, did just that. The Texas-based Princeton grad took hours of disorganized, miniDV tape footage housed on Google Drive and turned it into an archive “that let me search my family videos by memories, not timestamps,” she wrote in a July blog post. It was the ultimate Father’s Day gift.
We spoke with Markowitz recently to find out how machine learning helped her get it done, but why AI is only one part of the puzzle when it comes to solving complex problems.
Although this project used a raft of Google tools, which we’ll get to, it was actually not for the day job, but the coolest Father’s Day gift, right?[DM] At Google, I spend lots of time trying to think up new use cases for AI and build prototypes focused on the more “business-y” side. But I always wanted to work on more fun, zany stuff and, with quarantine, I finally had SO MUCH TIME. So, yes, this one was a gift for my dad—who, by the way, is also a huge programmer nerd who works in machine learning.
As your dad works in machine learning, he would totally get what it took to build it out. Let’s go “under the hood” with the details.[DM] Sure. So I uploaded all of my dad’s videos to a cloud storage bucket and then analyzed them with the Video Intelligence API, which returns JSON. Basically, the API does all the heavy lifting including: detecting “scene” changes; extracting text and timestamps on screen using computer vision; transcribing audio; tagging objects and scenes in images; and so on.
Because you needed to apply intelligence to what was probably hours of untagged material, right?[DM] Exactly. When my dad recorded on miniDV, the clips weren’t saved into separate files. They’d all be smashed into one long, three-hour recording, separated by little flashes of black and white. The API was able to pick out where those clip boundaries should have been.
Regarding audio transcription, that must have helped in tagging, categorizing, and identifying what was on all those miniDVs.[DM] Yes, and I found this to be the coolest part of the project, because it let me search for hyper specific things like “Pokemon” or “Gameboy.” Also, my dad was a big video narrator, so I could search his commentary for milestones.
As an applied AI engineer, you’re experienced in this field, but others using the API won’t need to be up on machine learning, right? Essentially, it’s not quite, but almost, out-of-the-box in terms of building out the metadata and intelligence?[DM] Confirmed. You don’t need any ML expertise to build out this project. It’s very developer-friendly. Having said that, there was one more AI part of this project, which was implementing search. I wanted to be able to search through all those transcripts, scene labels, and objects labels, but I didn’t want to have to exactly match the words.
Because you needed a proper semantic search layer for this project? [DM] Exactly. I wanted to allow for near-matches and misspellings and even matching synonyms, such as treating the word “trash” the same as “garbage.” As you know, in “semantic search,” you want an algorithm that understands the semantic meaning of what you’re saying regardless of the specific words and spellings you use. For that, I used a great “Search as a Service” tool called Algolia. I uploaded all my records (as JSON) and Algolia provided me with a smart semantic search endpoint to query those records.
Obviously, you’ve got a corporate account as a Googler to use all these tools. But what would the cost be for a non-Googler to do this? And are you sharing your GitHub codeyour GitHub code so people can replicate this?[DM] Yep, the code is all open source. Though I should add that a lot of these features are available through Google Photos, which works with videos too, apart from the ability to search transcripts. Cost-wise, I analyzed 126GB of video (about 36 hours) and my total cost was $300. I know that seems high, but it turns out the bulk of the cost came from one single type of analysis—detecting on-screen text. Everything else amounted to just $80. As on-screen text was the least interesting attribute I extracted, I recommend leaving that out unless you really need it. Also, the first 1,000 minutes of video falls in the Google Cloud free tier. Besides the ML parts, storing my data in Algolia runs me around $50 a month for around 90,000 JSON objects. But I haven’t done much optimizing, and they do have a free tier.
You’re the overall host on YouTube for the new series “Making with Machine LearningMaking with Machine Learning.” What’s up next there in terms of projects?[DM] Machine-generated recipes, automatically dubbed videos, and an AI dash cam. Well, if I can get those things to work—I never really know until I start building them. Another thing I’ve been fascinated with lately are ways to do machine learning with little or no data, and “zero-shot” learning. More on that coming soon.
We’ll look out for those. Now let’s do some background on you: What drew you to study computer science and why specifically at Princeton?[DM] I originally decided to go to Princeton because I wanted to be a theoretical physicist, and I really admired Professor Richard Feynman when I was in high school. But back in 2013, when I was a sophomore in college, it really felt like computer science was the place to be: everything was developing so quickly—Arduino, AI, brain-computer interfaces. In retrospect, though I didn’t know it then, majoring in computer science was a great decision, because there’s almost no field, scientific or otherwise, that hasn’t benefited from machine learning. In fact, sometimes it seems like some of the most cutting-edge work in biology and neuroscience and physics is coming from ML.
You worked as a researcher on brain-machine interfaces to measure sustained attention. Can you give us a brief explanation of what you were doing there?[DM] In that lab, some researchers had discovered they could (roughly) measure attention by having people do an extremely mundane task in an fMRI machine and then analyzing their brain scans. They were actually using deep learning, which was pretty revolutionary in neuroscience at the time. The problem is that fMRI machines are extremely expensive. I was investigating whether you could get similar results using an EEG machine (which is much cheaper), and specifically a portable, wireless EEG (which is much much cheaper). The results were mixed, but I think, since then, portable EEG machines have gotten better at taking clear readings, and I have gotten better at machine learning.
You moved from data science to applied AI and your focus is on how people can apply AI, ML, etc. But do you also interface with the more theoretical AI people at Google too or only tangentially?[DM] There is a pretty tight relationship between Google Cloud and Google Research. The field changes so quickly that there has to be. When a splashy research paper comes out, it takes almost no time before customers start asking how to get it on Google Cloud. One good example is around explainability and responsible AI. Now that machine learning is becoming more accessible, more folks can build their own models. But how do you know those models are accurate? How do you know you can trust them, and that they won’t make predictions that are embarrassing or offensive? The answer is closely linked to explainability, our ability to understand why models make the predictions they do—i.e. it’s hard to trust “black box” models.
Yeah, there’s a big push for explainable AIexplainable AI right now.[DM] This is a tough problem, and an active area of research across Google. But we’ve been working very closely with Google Research to add explainability into our customer-facing products.
At Google I/OGoogle I/O 2019, you focused on democratizing AI—allowing developers to use Google’s AI tools, like AutoML, and off-the-shelf APIs to create cool stuff. Tell us more about that. [DM] ML has gotten way easier and more accessible for developers over the past five years. And one of the reasons that’s so exciting is because more people from different backgrounds start using it and we end up with very creative projects. Sometimes people see a project I’ve built and they’ll riff on it, which I think is super cool. For example, I built a tennis serve analyzer, and then some folks built a cricket and a badminton version. I saw a yoga pose detector, and someone built an AI Diary using some of the same tech as my video archive analyzer.
Thinking more broadly, it occured to me that many of your AI-powered projects are applications that could help non-neurotypical people to navigate the world. For example, you engineeredengineered an AI Stylist which could illuminate social cues and help people be workplace appropriate or situation appropriate.[DM] Interesting. On one hand, there are definitely great applications of AI for non-neurotypical folks. The most compelling one I’ve heard of involves using computer vision to understand facial expressions and emotions. On the flip side, I try to avoid using machine learning in situations where the result of a mistake is catastrophic.
On that note, when I interviewed Dr. Janelle Shaneinterviewed Dr. Janelle Shane, she had some bizarre brownie recipes generated by one of her AIs, because that stuff is harder than most people imagine. For example, AI doesn’t have “common sense,” so you had to build in rules that a human wouldn’t need – i.e. “I need two shoes, a left one and a right one, but only one shirt or hat.” Any wardrobe mishaps with the stylist before it got it right?[DM] Oh yes, 100%. Furthermore, I would say using a combination of ML and human rules is a pretty good design pattern. One mistake I see people make a lot is try to completely, end-to-end solve a problem with AI. It’s better to use ML only for the parts of your system that really need it, such as recognizing a clothing item from an image. But then writing simple rules in places where ML isn’t necessary—such as combining clothing items to make an outfit. Human rules—i.e. “An outfit contains exactly two shoes”—are usually easier to understand, debug, and maintain than ML models. One thing that seemed to trip up the stylist app was that I took a bunch of pictures of clothing on mannequins; my vision model was trained on pictures of people, not mannequins.
The vision model which was looking for humans not static clothes horses?[DM] Yup. That really tricked the model. It was convinced the mannequin was a suitcase or something. By the way, I published the code on GitHub if others want to try it out.
At Google I/O, you also talked about the custom sentiment analysis using natural language. Has that been deployed into something cool like a concurrent translator that can detect irony or emotion—i.e. good for non-native speakers while on business trips abroad—if we ever get to do those again?[DM] Interesting idea. We’re still struggling with irony detection in NLP. But can you really blame a computer for not recognizing irony when lots of humans can’t, either?
Good point.[DM] I also suspect irony is largely contextual—i.e. text paired with an image, or spoken in a particular way, which makes the problem more challenging. Detecting emotion from speech is a cool idea. But I’d probably opt not to analyze just the words the person is saying (text sentiment) and focus more on their intonation. Sounds like a neat project. But like many ML problems, the challenge is finding a good training dataset.
True. So, wrapping it up, do you see the AI tools that you’re working with now are a way of building a “smart layer” between IRL and our silicon cousins (embodied/non-bodied AIs)? For example, when I interviewed AI researcher Dr. Justin Liinterviewed AI researcher Dr. Justin Li, we talked about AI being able to anticipate our needs before we know we have them. [DM] In the future, yes, I think humans and AIs will work closely together. But for me what’s most compelling are use cases where machine learning models are uniquely well-suited to do something that humans can’t do or aren’t good at. For example, people make really good assistants and companions and teachers, but they’re not very good at processing millions of web pages in seconds or discovering exoplanets or predicting how proteins fold. So it’s in these applications, I believe, that AI can make the most impact.
keyword: How a Google Engineer Used Her AI Smarts to Create the Ultimate Family ArchiveHow a Google Engineer Used Her AI Smarts to Create the Ultimate Family ArchiveHow a Google Engineer Used Her AI Smarts to Create the Ultimate Family Archive