Ai Video Generation Models Compared


In March 2025, I wrote an article on the best AI image generators, and now in January of 2026, I’m making this updated version, with 10 different Ai video generation models compared.

For this ai video comparison test, I used 10 different AI video generation models and have each of them output the exact same text prompt, in order to see how they each interpreted my prompt’s request.

Advancements in LLM technology have made this comparison-generation easier than ever.
The March 2025 article took me about 5 or 6 hours of manual labor.
This January 2026 article took me only about 2 hours, because I used Script-Bot Creator on Poe.
Click here to see how I made our whole video comparison, and most of this article, over the course of about 8 prompts in the span of an hour.

The Prompt

Here’s the text prompt I fed into every ai video model:

“Pullback shot depicting a guitar being strummed by a grunge-rock musician, standing in front of the reflecting pond at Balboa Park in San Diego, California.”

I chose this specifically because I’ve actually visited Balboa Park in real life, so I know what it should look like, which helps me judge how accurate these ai video generators are.

The Models Tested

  • Sora-2 and Sora-2-Pro (OpenAI)
  • Veo-v3.1 (Google)
  • Kling-2.6-Pro and Kling-2.1-Master (Kuaishou)
  • Wan-2.6 and Wan-2.5 (Alibaba)
  • Hailuo-02 and Hailuo-Director-01 (MiniMax)
  • LTX-2-Pro (Lightricks)

Note: This represents a SINGLE output on a single text-to-video prompt. It’s hardly representative of the whole model in general—different prompts might yield very different results.

AI Video Generation Models Compared

Sora-2 (OpenAI) 🏆

Sora-2 review - Winner for prompt accuracy

36:00

Duration: 12s | Audio: Yes | Gen Time: 166.6s | Cost: $1.08

I would say Sora-2 is actually a better output than any of the others. Which surprised me. This looks most like what I asked for, even if the results are not as pretty as Kling. The California Tower is actually visible in the background—though not in the accurate position relative to the botanical building—but it captured the vibe.

This guy’s singing a coherent song—almost makes sense. Like I wouldn’t necessarily think twice if I just saw that on YouTube passively, I would be like, okay. Maybe that is the real trick: finding that most general video model that can handle most things without extra handholding.

Veo-v3.1 (Google)

Veo-v3.1 review - Expensive but wrong location

35:43

Duration: 8s | Audio: Yes | Gen Time: 80.1s | Cost: $3.20

The result was incredibly accurate to what a video should look like of a person playing in an outdoor environment, but it has nothing to do with Balboa Park. It looks nothing like it. If I had to guess, I’d say this is like a golf course or a college or something.

On the technical side, the guy’s not singing, which I like. But by the way, this was the most expensive generation of the bunch at roughly $3.20. It proves that a higher price tag doesn’t necessarily mean it understands geography any better.

Kling-2.6-Pro (Kuaishou)

Kling-2.6-Pro review - Not accurate to Balboa Park

36:53

Duration: 10s | Audio: Yes | Gen Time: 152.4s | Cost: $0.70

This is the newest Kling model. It has audio, but I would say this was not a good use of compute points in terms of the output. It is simply not accurate to Balboa Park.

It feels like a step sideways. While the motion is fine, if I ask for a specific landmark, I expect to see it. It didn’t deliver on the location accuracy I was looking for.

Wan-2.6 (Alibaba)

Wan-2.6 review - Unwanted quick cuts

37:54

Duration: 15s | Audio: Yes | Gen Time: 306.5s | Cost: $1.12

This model made some choices I didn’t love. It’s doing quick cuts, like it’s making a finished music video. I didn’t like that it made a bunch of cuts—that’s not what I asked for. Visually, it’s San Diego-ish in its aesthetics. Closer to a place like Mission Bay than it would be Balboa Park.

And this guy is speaking, not singing. It also had the longest duration at 15 seconds, but it was the slowest generation time of the group, taking over 5 minutes to render.

LTX-2-Pro (Lightricks)

LTX-2-Pro review - Budget friendly

38:15

Duration: ~5s | Audio: No | Gen Time: 110.1s | Cost: $0.36

This is my first time trying it. That guy looks like he’s happy to be playing the guitar. This looks more like San Francisco—like Millennium Park or something. Nice, looks good, but it’s not great. It’s not Balboa Park.

There’s no audio, so that’s cool to me. It was also the lowest cost at just $0.36. If you are on a budget and just need “generic guy with guitar,” this works, but don’t count on it for specifics.

Sora-2-Pro (OpenAI)

Sora-2-Pro review - Made vertical video

39:30

Duration: 12s | Audio: Yes | Gen Time: 146.7s | Cost: $1.08

This one made a vertical video—not what I asked for. It also has audio with singing where the guy is belting out, “Standing in the sun, I hear the city calling me home.” I feel like it added some info beyond our original prompt.

Also, my original “grunge-rock musician” prompt was moderated by OpenAI, so I had to tweak it slightly. It seems the “Pro” version tries to be a bit too creative for its own good sometimes.

Kling-2.1-Master (Kuaishou)

Kling-2.1-Master - Most pretty but not accurate

39:38

Duration: 10s | Audio: No | Gen Time: 183.3s | Cost: $0.90

Personally, this is the most realistic and pretty looking video. It’s a clean pullback. High resolution clip—suitable for a lot of things. But by the way, it looks nothing like Balboa Park. At least not where I had depicted in my text prompt.

It has no audio, which I prefer since AI audio is usually weak. It produced the largest file size, indicating a high bitrate quality. If you want “pretty,” this is the one. If you want “accurate,” look elsewhere.

Wan-2.5 (Alibaba)

Wan-2.5 review - Homogenization of Balboa Park

40:10

Duration: 10s | Audio: Yes | Gen Time: 123.7s | Cost: $0.50

This is one model previous to the 2.6 we looked at. That guy looks fine. It’s not Balboa Park. It’s like a homogenization of a couple things in Balboa Park, but it’s not the real place.

I’m over the AI audio—audio’s not a strong suit of video generators right now. It’s a cheaper option than the 2.6 version, but the quality drop-off is noticeable.

Hailuo-02 (MiniMax)

Hailuo-02 review - Looks like Spain

40:39

Duration: 6s | Audio: No | Gen Time: 125.9s | Cost: $0.42

Also not Balboa Park. Looks more like Spain or some sort of university somewhere, I don’t know where. It does a decent job of video animation, you know, realistic video.

No audio here either. It’s very budget-friendly at $0.42, but again, if the location matters to your storytelling, this one missed the mark.

Hailuo-Director-01 (MiniMax)

Hailuo-Director-01 - Walking on water

41:30

Duration: 5s | Audio: No | Gen Time: 182.3s | Cost: $0.50

This one got a little crazy because he was walking on the reflecting pond water, so that’s not ideal. The physics just went haywire.

This model is supposed to offer more camera control, but in this instance, it sacrificed basic reality to get the shot. Not a keeper.

Cost Comparison

I tracked the exact cost and generation time for every single video so you can see the breakdown.

Model Provider Duration Audio Cost Gen Time
Sora-2 🏆 OpenAI 12s Yes $1.08 166.6s
Veo-v3.1 Google 8s Yes $3.20 80.1s
Kling-2.6-Pro Kuaishou 10s Yes $0.70 152.4s
Wan-2.6 Alibaba 15s Yes $1.12 306.5s
LTX-2-Pro Lightricks ~5s No $0.36 110.1s
Sora-2-Pro OpenAI 12s Yes $1.08 146.7s
Kling-2.1-Master Kuaishou 10s No $0.90 183.3s
Wan-2.5 Alibaba 10s Yes $0.50 123.7s
Hailuo-02 MiniMax 6s No $0.42 125.9s
Hailuo-Director-01 MiniMax 5s No $0.50 182.3s

Total Cost: 328,667 compute points ($9.86) for 10 videos generated across 10 different AI models at maximum settings.

Key Takeaways

  1. Prompt accuracy matters more than visual beauty. A pretty video is useless if it’s the wrong location.
  2. AI audio isn’t great yet. When the video has audio, I feel like it’s usually not great. Like, it would be replaced in the end anyway if you were using this in any sort of professional capacity.
  3. More expensive doesn’t mean better. Veo cost me over $3 and got the location completely wrong.
  4. Watch out for unwanted edits. Some models will cut the video or change the aspect ratio without you asking.
  5. Physics can still go haywire. We still have people walking on water.

The Verdict

Winner: Sora-2

For this specific test, Sora-2 wins for prompt accuracy. It was the only one that really gave me Balboa Park.

Maybe that is the real trick: finding that most general video model that can handle most things without extra handholding.

My Recommendations:

  • Use Sora-2 if you need the AI to actually follow your instructions regarding specific locations.
  • Use Kling-2.1-Master if you just want the most visually stunning video and don’t care about the details.
  • Use LTX-2-Pro if you are on a budget and just need a quick clip.

Closing Thoughts

Keep in mind, these are text-to-video results only. Image-to-video is a whole different ballgame—that’s a comparison for another article.

We learned a lot of good things in this. Maybe I’d be more skeptical when I see things I think I should know, and they look different than I know they are. But who knows? That’s the state of AI video in 2026.

If you want to try these out yourself, you can find them at Poe’s video generation section.


Previously: Best AI Video Generators in Poe AI (March 2025) | Related: What Is The Best AI Image Generator?

This article was created with assistance from Script Bot Creator on Poe, using Gemini 3 Pro for writing.


Leave a Reply

Your email address will not be published. Required fields are marked *