I’m trying to make a short film, using AI. I have the script and a strong artistic vision! How hard could this be?
The answer: much harder than I thought. Here are the issues I encountered.
Problem 1: Veo3 can’t generate good audio
Most AI video models can’t generate audio. Veo 3 can generate audio, but you have very little control over the outputs:
- The spoken dialogue might not follow your script
- You can’t easily control emotions, tone & inflections
- You can’t maintain consistency from one video generation to the next
Prompt:
Mr. GSD cuts a wrestling promo. He's shouting, slowly, and angrily. # Dialogue "5 AM - you’re dreaming your sad dreams. I’m hitting personal bests at the gym". # About the shot The camera orbits to the side of his body; he continues to look straight, not directly at the camera. Photo style: - 2020s, hypermodern, film grain About the scene: - Early in the morning, with sunrise barely visible - Mr. GSD is at a rooftop gym. Dumbbells and various gym equipment is visible in the background. A laptop is visible in the background. About Mr. GSD: - Physique: Muscular, highly toned, athletic build. - Appearance: Short, perfectly trimmed hair and beard, intense gaze with a stern expression. - Attire: High-performance athletic gear—compression shirt with the phrase "Get Shit Done" prominently displayed, sleek shorts, sports watch, fitness trackers on both wrists, and sneakers. - Accessories: Headset microphone, futuristic sports sunglasses, a utility belt holding energy bars, stopwatch, and productivity tools. - Color Scheme: Black, neon green, and silver, evoking intensity, tech-savvy, and productivity.
When it works, it is mightily impressive! And it works great for 8s videos.
But if you want to make a longer video, with consistent characters, you can’t rely on Veo3’s audio. It’s not possible to guarantee that your character will sound the same from one video to the next.
You can use another tool like ElevenLabs to generate audio… but this leads to another issue.
Problem 2: Syncing audio & video
I want my character’s lips to reflect the actual words being said. If I generate the audio & video separately, this is really hard!
I tried to use video editing tool to manually get audio & video to sync up. But doing this takes hours for just a 10 second clip, and the results aren’t great.
I tried using a variety of AI lipsyncing tools: sync.so, vozo.ai and RunwayML’s tools. To use these, I uploaded my video, along with my dialogue generated using ElevenLabs. But the results were poor! I can’t tell why.
I’ve seen very impressive demos for lipsync tech that work well… are there artifacts generated by Veo3 that make lipsyncing hard?
Problem 3: Maintaining consistent characters
Needless to say, I want my characters to be consistent from scene to scene, and shot to shot. This is hard to achieve and take a lot of trial & error. Here’s what worked for me:
- Using a reference image with Veo 3 helps a lot
- Store prompts in a google doc so you can re-use them easily
- Minimize complexity in a scene. Cut out props & costume elements.
Problem 4: Workflows
I have to jump between 6 different tools to build a simple video. Here’s my current setup:
- Write a script on google docs (mostly by hand - ChatGPT is a decent brainstorming partner though)
- Record the dialogue for the script myself using any voice recording tool, and my Macbook Pro’s default microphone
- Generate a voice for my character using ElevenLabs
- Upload my recording to ElevenLabs, and clone it using the character I generated
- Generate a reference image to represent my character, using stable diffusion. Iterate on the character. Store my prompt in a google doc so I can re-use it later.
- Upload the image to Veo, via Flow
- Pray that Flow doesn’t tag the image for being a public figure, because it will refuse to generate a video if so (this happened to me multiple times)
- Prompt flow, and generate your videos in 8 second chunks. Iterate on these videos.
- Open all the videos in a video editor (I used iMovie). Stitch them together.
- Add the audio track from ElevenLabs. Somehow, magically, get lipsyncing to work (haven’t figured this out yet)
- Add subtitles as a part of the video (gotta bump up Instagram engagement), using another tool (haven’t figured this out yet)
- Upload & share
I’m sure this could be made easier somehow - maybe I haven’t found the right setup yet.
Problem 5: Safety guardrails
Veo’s safety guardrails are well intentioned. But it makes it difficult to generate the content I want.
My lead character, Mr. GSD, is (clearly) a tech bro spoof. In his introduction scene, I wanted him to be doing barbell squats, while using a laptop tray (a-la Nathan Fielder’s laptop harness). And I wanted Mr. GSD to dramatically drop the barbell, and grunt loudly. No matter which prompts I tried, I couldn’t get Veo to give me a video of the barbell being dropped dramatically! This is the best I got:
I should caveat this: I don’t know for a fact that generating a dramatic barbell drop was impossible due to safety reasons. But Veo 3 does have strong guardrails against depictions of violence or harm. And, safety guardrails was ChatGPT’s best hypothesis for why it didn’t work 🤷🏼♂️
Problem 6: Unintentional trademark violations?
Veo 3 generates videos that accidentally incorporate brand trademarks. For instance, if you ask it to generate a wrestler holding a microphone, you get a perfect WWE microphone!
But like - I don’t want this! I don’t want to get sued! Maybe I can prompt my way out of this situation, but it adds to the workflow complexity. I definitely don’t want to manually edit out all of these trademarks that appear accidentally.
Given that Youtube videos are used to train Veo, and the WWE is a very prominent publisher of legal & high quality content on Youtube, maybe WWE videos have found their way into the training data?
Other random tips:
- Be very careful when including text in the video. Avoid it if possible - especially if it is text which needs to remain consistent from scene to scene
- Audio cloning is superior to text to speech for generating dialogue. It’s very hard to get inflections & tone correct if you give ElevenLabs plain text to read out. But, converting my own voiceover recording, and cloning it using a character voice on ElevenLabs worked really well.
- With that said: the primary audio recording needs to be high quality. For instance, if the primary audio recording mutters towards the end of a sentence, so will the voice clone. Audio artifacts will remain, or become worse. So using bluetooth microphones and a noisy environment won’t work well! Also, you need to be a decent voice actor. If you don’t enunciate well, nor will the voice clone!
The results
… are not very impressive. But hey, ship an embarrassing v1 and all that.
The film is about a professional wrestling company that’s struggling to stay alive. And here’s a scene, showing a promo from the company’s top star, Mr. GSD:
Coming next
As soon as I can figure out how to fix the lipsync problem, I have a lot of individual scenes to share! And hopefully, full episodes soon. If anyone has tips on how I can fix these issues, send me a note on LinkedIn (I know, I’m a total corporate stooge). Vendors welcome - I really want to ship this!