Q&A: Microsoft Worldwide M&E Strategy Director Simon Crownshaw Talks Gen AI

Article Featured Image

In this expansive interview with Simon Crownshaw, Microsoft’s worldwide media and entertainment strategy director, we discuss how Microsoft customers are leveraging generative AI in all stages of the streaming workflow and how they’re using it in content delivery and to enhance user experiences in a range of use cases. Crownshaw also digs deep into how Microsoft is building asset management architecture and the critical role metadata plays in effective large-language models (LLMs), maximizing the value of available data.

Nadine Krefetz: How are your customers talking about generative AI, and what use cases are they starting with?

Simon Crownshaw: Asset management, user experience, video delivery, and compression. There is a common thread around why all of these models have to work together and why our customers are thinking about not just one model, but many.

Krefetz: Why is metadata so important?

Crownshaw: Today, most people can’t find anything. We need to do a lot more automated content retrieval, whether it’s for news or live content on streaming platforms, to find everything a little faster. If you look at the work that we did with NBCUniversal and Comcast around the Olympics with the Al Michaels voiceover, for example, bringing all of those assets into a Cosmos database with the right tags allowed viewers to quickly do a natural language search to find something.

When you bring in the models and the cognitive services to overlay the voice, it’s much easier if all of that metadata is consistently organized against a data model that is able to work in tandem with the large language model. When you’re having to pull it from some random Excel files or inconsistent metadata that might exist with all content, it’s very difficult to do.

Krefetz: Can you describe some of the asset management architecture?

Crownshaw: We think about the data ingestion layer in terms of how we bring in that content—the raw media files, the metadata ingestion. To understand how to associate the media with the right descriptions, we’ll process all of the different elements by taking frames extracted from the video and putting that into text using Apache or TensorFlow. Then the generative AI piece is built on top.

For a scene detection model, you want to understand how to recognize and categorize different types of scenes. You might do a sequence-to-sequence type piece, which is like a GPT transformer, to understand how to put those concise text summaries, subtitles, or scripts in the right place. Typically, we would do that through some deep learning frameworks.

Some of the tools are available on Azure, such as the open source Hugging Face transformers. Some of the video analysis could go through a temporal convolution-based long short-term memory (TCLSTM) network and others to understand that.

Krefetz: Is metadata finally getting its moment in the spotlight?

Crownshaw: The vast majority of the decision makers I have met with are choosing to fix their data. More than half understand the need to make their data better, because in the long term, it’s going to help them, whether they build their own models or leverage someone else’s.

If you have really bad data, you’re going to ask it to compute more. The more efficient you can make it, the more you can reduce the compute power required. You’re also going to remove some of the complexities you’re asking generative AI to solve.

Krefetz: How do you define a large language model?

Crownshaw: A large language model is a type of artificial intelligence designed to understand human-like text-based patterns that it has learned from a vast amount of text data. Typically, LLMs use deep learning techniques like transformer architectures to process and generate language.

These three key elements of an LLM are scale, pre-training, and contextual understanding.

An LLM is trained on a massive amount of data with a diverse range of topics and styles, which allows you to generate lots of different answers. Most of the models have gone through some sort of pre-training or fine-tuning to improve performance. And contextual understanding means they can understand and generate text in a way that makes sense to anybody who’s looking at it.

Krefetz: If you don’t go through these training stages properly or have the data model set up correctly, then how is generative AI going to behave?

Crownshaw: It’s going to give you some very random results. A lot of the work we do with customers is the foundational data model work [to provide a] structure to use for audio, video, text, and images as you pull those multimodal things back together [so it knows what to do with] a list of characters, scenes, and scripts.

We start off by restricting the data we use to make it pull consistently from the same model. As people learn how to use the different types of prompts or data behind the scenes, they’re able to mitigate those hallucinations that may or may not occur.

Krefetz: Is there a standardized approach to data models, or does it depend on the company?

Crownshaw: I’ve never seen a standard data model that works for all. I’ve seen many customers approach it differently. My last company, Disney, had what they would call a “mapping exercise” that would go through many elements of all of the different types of video or content being created and map it through the process. [This would include] everything from which camera it came from, to who’s in it, which scene, and so on, and how they would lay that all out from a data model perspective. But I’ve always seen that as being a very bespoke thing.

Because you’re dealing with a lot of old assets that have grown organically over time, it needs to be synthesized in a way that the large language models can understand. What you’re seeing is the neglect that this data has probably had over the past 20 or 30 years because of the need to quickly get content out. It’s now catching up with them, [and there’s a] need to go back and refine data so we can use these models effectively.

Krefetz: Can you talk about some other asset management considerations?

Crownshaw: Asset management obviously comes with high computational cost. Processing and generating summaries of high-resolution video require significant resources, so architecting those in a way where I can ingest content effectively and quickly is important.

Using generative AI to look at the most relevant content to ingest and where to tag and use it is critical. You really need deployment and scaling services. Then you need monitoring, logging, and analytics as well.

As you gather more data in that process, you’re going to need analytics to gain insights into how those models are performing and how people are engaging with the content. It could be Microsoft Azure Synapse that helps you understand what’s going on with the data or maybe even Microsoft Fabric, where we bring multiple data sources together.

microsoft azure synapse

Microsoft Azure Synapse

Krefetz: How does having content in multiple clouds impact generative AI use cases?

Crownshaw: More and more media customers are using a hybrid approach, combining more than one cloud with on-prem content pieces. They need to use services that bring all of that together for more real-time access to that data.

A customer could have content in AWS and Azure, but at some point, I need to know how all of that is working, because that helps me understand the workflow and the pipeline of how it is being delivered. For this, Microsoft provides Azure Arc, which enables you to pull content and connect other pieces from other clouds.

microsoft azure arc

Microsoft Azure Arc

We also have Microsoft Fabric, which enables you to query data in AWS directly and have it all in one place. You need that ability because it’s hard to move data, to make all of those infrastructure and database changes. As those large language models get used more for things like asset management, we’re going to need access to all of that data, even if it’s not in the same cloud as something else.

microsoft fabric

Microsoft Fabric

Krefetz: How are your customers using generative AI in content delivery?

Crownshaw: On the content distribution side, our encoding partners, Harmonic and MediaKind, are working with some of our customers like FIFA, for example, using generative AI to dynamically adjust streaming quality and bitrate in real time to ensure a smooth streaming experience.

[In the Middle East,] we’re using generative AI models to translate and localize content into multiple languages in real time to automate dubbing and subtitling or remove words [in accordance with country-specific regulations].

Krefetz: With adaptive bitrate encoding, what are the benefits of using generative AI?

Crownshaw: A generative AI model will analyze a large amount of content and understand scenes with complex visuals or high motion. It can predict if these factors have bitrate requirements, then dynamically adjust based on the content analysis that it’s already done. If the model detects a scene with high motion, it might recommend increasing the bitrate to maintain visual quality. Conversely, for static scenes, it might suggest lowering the bitrate to save bandwidth. That is not done with typical AI models. This is really new—with generative AI, it’s done dynamically.

For personalized experiences, the model might adjust settings based on individual user preferences or device capabilities and dynamically adjust. It will do network condition monitoring as well. In real time, it can predict network congestion and adjust without anyone having to do anything. The automation starts to kick in to minimize buffering or interruptions.The last thing it does is predictive caching. Typically, what we have seen in the past is that you would try and understand who [authenticated the request], what [content they want], and where [is closest to cache content], but now with generative AI, we can use historical data and machine learning models to predict content demand and pre-cache the content at the appropriate quality levels, so it helps smooth the transitions between different bitrate streams.

Krefetz: Is it more cost effective to do quality-of-service monitoring/delivery than metadata creation?

Crownshaw: We’re probably at a point now where we don’t know the answer. I think the jury’s still out on whether one is cheaper than the other. What we see is that each is going to provide critical gains and opportunities in terms of how they work going forward.

Krefetz: Let’s talk about streaming content recommendations and the end-user experience and how they’re impacted by the transition from traditional AI to generative AI.

Crownshaw: Traditional AI has been used for a long time now. Generative AI is different. With generative AI, architecturally, we have:

  • A massive data collection layer that understands how we interact with the content
  • The metadata that gathers information about the genre, the actors, the tags, and all the rest of it
  • Feedback data, which captures whether you and I like it
  • Some sort of data processing layer to understand what happens with all of the content on delivery

How do we aggregate all of those pieces and apply the model on top to provide recommendations? We did some of this during the Olympics, producing personalized elements. You have to make sure the application layer is there to deliver those personalized experiences, and you’re going to need a massive infrastructure layer (through cloud services like Azure, for example) to deploy the storage, the compute, and the AI models to make all of that come together.

When a user watches and rates content, we might see the interaction data that was collected. Once we process and analyze it, the data is then clean. All of the different features are extracted, and the relevant data is fed into the generative AI model.

Then, we’re going to use the model to create a recommendation engine for new content while dynamic thumbnails and interactive content are generated. The user sees that personal recommendation and dynamic thumbnail, then the feedback loop would be complete when the user interacts with that and is then monitored continuously to improve the AI model.

That’s an example of how that generative AI flow could technically work. We might use Cosmos DB to do data collection, Synapse for the processing, and Azure OpenAI for the model itself. For the application layer, we use Azure Front Door, which can do all the load balancing and content delivery. We use Azure Monitor Log Analytics to collect and analyze that data, and then we might layer that all into a Kubernetes service on Azure to manage and scale those containerized applications, including all of the models together so it runs in a streamlined architecture.

Krefetz: Is it easier to show ROI for some generative AI use cases than others?

Crownshaw: For the NBA, we use a lot of their content to create custom highlights. Not only were we able to build that service, but we were able to build it faster. What might have taken 2 months took us 2 weeks or less. We saw an exponential increase in the number of views that went from a quarter of a million to over a billion. People could interact faster. The cost of doing that was exponentially less, but the return they were seeing was 3.5–4 times.

One area that is important and relatively straightforward is taking advantage of the metadata for archives that media companies have and making sure that they can be streamlined and delivered to customers really fast.

Now we’re having a similar conversation that we were having around the cloud about 10-plus years ago: “Is it cheaper to take my content and whole media processes to take advantage of the scale that the cloud could provide?” Today, we’re talking about the return on investment using generative AI for new pipeline and workflow opportunities. 

The answer is, I need to do projects that are relatively quick to show return on investment. I can scale and refine the architecture (because there’s never been enough efficiency in those pipeline opportunities or processes in the past few years) to make sure that content delivers as well as it possibly can. Foundational work is being done on data models to make sure we can service the right content at the right time.

Krefetz: What’s the time frame for other projects?

Crownshaw: There are immediate pipeline opportunities across major studios and broadcasters that have a laundry list of use cases that they want to run, including the adaptive bitrate pieces we talked about previously. I would say we are down to days and weeks, and it’s happening significantly faster.

Krefetz: What about security?

Crownshaw: From a security standpoint, as we’re preparing the stream and getting the CDN components ready, we’ve got to think about the security elements. For a lot of studios and streaming platforms, the security and compliance pieces are going to be front and center so they can understand what’s actually happening with the content and how it is being secured.

At some point, we need to have model improvement and to have security throughout the entire process, not just at the end. How do I make sure it’s encrypted in the right way? Which platform did it end up on? Where did it come from? I would imagine that generative AI will be used significantly more across that whole process, making it much more automated than it was in the past.

Krefetz: What are your thoughts on large language model co-existence? Will they be competitive or cooperative?

Crownshaw: I think what we’re going to see is more models having to work together. There’s not going to be one model that does it all; there are going to be many different types of models. Some models handle data, some handle text, some handle images, and some handle audio. They all need to be integrated for richer outputs in terms of how they work.

All of the models are going to get better based on the feedback loops, the retraining that they get, and the data they get access to. That’s going to be an important process for generative AI going forward.

Krefetz: Any closing thoughts?

Crownshaw: I get asked a lot of questions by customers who are thinking about what their pipeline architecture is going to look like. These are the stages to look at for pipeline architecture:

  • Understanding the raw data input and what those model stages might look like
  • Intermediate processing, which consists of outputs from those initial generations that need to be refined
  • The feedback loops, to figure out which models work well together and which ones don’t
  • Cross-model integration, maybe in clean rooms, where different models will generate different outputs.
  • A meta model or aggregation mechanism for synthesizing the final result down to where it’s easy to work through.
  • Finally, the large computational framework that fits around it. For streaming, there are large-scale generative models that require lots of distributed computing environments leveraging GPUs and TPUs or cloud-based platforms like Azure to manage those workloads and enable efficient model training and inference.

I think the independent solution vendors are going to end up having their own little LLMs. You can call those models through different APIs to refine different processes. But I think it’s going to be an interesting next few months in the industry as those models start to come out, and there will be more API calls to make those models available for different processes.

Streaming Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues