July 18, 2025

16 minute read

A Guide on Sentence Mixing

Hello again(?)

CH FR

Hi, I’m CH FR, the article you’re reading here is the revised version of a guide I wrote a few years ago. I went back over some of the wording and made it a bit more interactive.

Intro

When learning to make YTPMV and Otomad audio, some parts will be easier to learn, it can depend on your affinities, but also on how much of it is based on concrete theory versus just “vibes”. Some techniques can even have both of those aspects at once: While the technicalities of “correcting a sample with VocalShifter” are easy to grasp, decisions like “do I leave the initial pitch slide intact or do I correct it to” is not that clear-cut.

Sentence mixing kind of feels like it’s almost entirely “guts” and “feeling”. It’s based on lots of tiny considerations and habits that aren’t often put into words, making it all look like some sort of innate skill.

Because it’s such a varied technique, it makes it hard to give advice in the form of broad, sweeping statements. As a result, most of the time someone gives advice, it’s often in a one-on-one, private context. It means there aren’t that many “guides” to Otomad sentence mixing.

However, I think there are parts that can be taught and passed down in a more generic way, and this is what I’ll aim to do with this guide. I believe in the importance of having public learning resources, so that the next generations can get to a higher point than us.

That’s why I’ve written down some of the stuff I picked up over the years. This guide is intended for people who are already familiar with basic audio-making. In it, I will explain the whats and whys of sentence mixing, I’ll give a few basic tips and pitfalls to avoid and finally, I’ll go over some of the mental categories that I’ve made up based on the structure and intent of the sentencing.

I’ll use the words “sentence mixing” and “sentencing” interchangeably, they’re the same thing.

With that out of the way, let’s get started!

Use cases for sentence mixing, what’s it for anyways?

Here are some of the reasons why you’d want to add sentencing to your audio.

Rhythm

Sentence mixing is another way to add a layer of rhythm to your audios, either from the flow of the syllables, or even as drums.

Uniqueness

More generally, sentence mixing is one of the reasons Otomads are what they are, most videos in Niconico’s all-time rankings have some element of sentencing. And it would be hard to imagine what Otomad or YTPMV would look like without those standout videos.

Humor

Humor is another way to make your work memorable. Like turning the dialog of your source material into dirty jokes, halting the rhythm in awkward places, etc.

There are also niche methods such as referencing other works or BGM PVs in clever ways, messing with the pitch (vibrato + wave warp to imply that someone is crying), diverting from a fad in unexpected way, etc.

Lyrics

Having a lyrical section in your audio can really amp up its cool factor, it makes for a nice surprise if it only comes later in. You can stick to the original lyrics, change them up, or even make up lyrics to songs that don’t have any.

After all, timing vocals to the beat and pitching them are just one tiny step away!… Just kidding, tuning alone is its own can of worms and you run into all sorts of complications when you ask the simple question “will this pitch well?”. As a result, I will not give tuning tips in this article. However, this doesn’t mean sentence mixing skills aren’t part of the process, which is why I’m still bringing it up here.

The basics

Vocabulary check

Let’s review the basics, those are concepts that can be useful no matter what style you’re going for. We’ll focus on rhythm.

Here’s a quick reminder of some of the terms you’ll see in this article:

BPM: Beats Per Minute (you need to know where or how to find it).
Time signature (same here).
The marker for a beat, since the time signature is 4/4, you have 4 beats for a measure.
The marker for a measure.

Let’s now move on to some actual tips.

Think with syllables

The first piece of advice I can give is to think of it in terms of syllables, just timing by word isn’t enough.

In this particular example, the difference is pretty subtle, which is even worse! It means that if you don’t notice it, you’ll be stuck with sentencing that’s just “sounds slightly off” without ever knowing why.

Timing by word means operations such as stretching will affect all of the syllables, so if just one syllable is off-time, fixing it will mess up every other syllable.

By the way, this is the song that I am using in this example

This might sound like a lot of extra work if you’ve done it with whole words until now, but it gets easier with enough muscle memory with your DAW. That’s why it’s essential to know and set your shortcuts.

Those are very important since you need to have snapping (for timing) and not have it (for cutting syllables) at the same time.

For example, in Reaper:

Shift when dragging or stretching ignores grid snapping.
Alt on the side of an item for stretching.
Alt in the middle of an item for re-timing without moving the item.
Middle mouse click to move the playhead while ignoring grid snapping.

Soft and hard consonants

So let’s say that you have timed your syllables down to a T, but the timing still sounds off.

In that case, soft consonants may be at fault.

I learnt about this concept when watching Mikami’s UTAU tutorial.

Soft consonants are drawn out (m, n, sh, s, …) while hard consonants sound percussive and shorter (t, k, p, b, …).

The problem here is that soft consonants can be pretty long, yet you only really “register” the syllable in the rhythm when the vowel comes into start. So the consonant is offsetting the “perceived start” of the syllables, making it sound off-time.

In this specific case, we hear the vowel a whole quarter of a beat too late.

The solution? Go off the grid! In your DAW I mean…

Offset your items so that the vowel starts earlier. The waveform is usually enough to guess where the transition between consonant and vowel happens.

And finally, make sure you don’t overdo it, soft consonants are not always that long.

If it sounds fine from the start, then that’s fine! You don’t have to mechanically apply those rules every single time.

Stretching

You could just stretch the whole syllable, but if you overdo it, you’ll stumble onto one big issue: Consonants don’t stretch well.

If it’s a soft consonant, then it will stretch alright but throw off the timing of the vowel sound, if it’s a hard consonant, stretching it too much will have undesirable effects. A “t” sound, for example, can start to sound like “d”. There are situations where you could exploit this, but this is off-topic.

One way to deal with that is by making a cut between the consonant and vowel part, then stretching only the latter.

Then extend either item so that you have a bit of overlap.

If you don’t do that, you’ll hear the gap between the two items.

If there’s a previous syllable, then make sure that no syllable is ever completely covered, or it will mess with the automatic crossfade.

Having to manage all that can be a bit of a pain, so if you’re using Reaper, you could learn how to use its Stretch Markers instead.

Sometimes you won’t be able to stretch the vowel part because it’s silenced, it can happen when syllables like し(shi), す(su), つ(tsu), etc.. are at the end of certain words. In that case, you can still keep the consonant part as the main rhythmic element.

Let’s look back at the previous example, with the timings now readjusted for soft consonants.

Dealing with gaps

You may have some gaps between your items, how you plan to deal with them can vary depending on the situation.

You could stretch the items to entirely fill the gaps, in which case, you’ll have to be careful about consonants as seen previously.
You could add some repeating syllables or stuttering.
You could preserve, or even embrace the gaps to make the next syllable more impactful, here’s an example (see the gap In Ara-i san, this doesn’t really apply to Fenne-c because the gap is part of the regular pronunciation for フェネック).

About point 3, you can also introduce tiny gaps at the end of your syllables to make the next one easier to hear, this can be used on fast-paced sentence mixing, and will make the whole thing sound pluckier with added compression.

In summary

Cut by syllables.
Make sure soft consonants aren’t throwing off your rhythm.
If you stretch, stretch the vowel part.
Know when to fill gaps and when to keep them.

Here’s an example of working with these concepts in mind, by Kolina.

Styles of sentence mixing

From this point on, we’ll talk style over technique. This part of the article is more subjective, since I have to label vague ideas about sentence mixing.

Knowing the different ways you can do sentence mixing, and what effects each of them can achieve, will help you to get a sense of what style best fits your needs.

The way sentencing is done can vary from person to person, video to video, and even scene to scene. By listening attentively, you’ll learn to pick up on reoccurring characteristics in many videos.

Here are some makeshift categories that I’ve found useful to slap a label on.

Meaning-focused

Defined by the commitment to making complete, structured sentences.

There’s a feeling of continuity.
The rhythm has some amount of variation, but is not the most important element.
Some rhythmic sacrifices can be made to fit the desired sentence in.
Can be the BGM’s lyrics, the source material’s dialog, or made-up dialog that tells a brand new story.

Here are some examples:

The sentence integrity is preserved, and the rhythm stays relatively consistent. At 0:34, the rhythm from the original lyrics is completely ignored in order to deliver the source’s dialog more smoothly.

At 00:48, we’ll see that most syllables are the same length (1/4th of a beat). In this case, the variations are dictated by the dialog itself: syllables are being elongated (― = ~) and put together (like the “ni n”).

Here, meaning takes precedence over rhythm, each syllable get roughly the same amount of time, and some sentences end up being longer than others.

The sentencing in this one has a less consistent rhythm, with syllable length varying wildly because of the anime’s frequent tirades and the song’s fast tempo. Depending on the song, this pacing may sound good. Otherwise, you’ll need to consider eliminating some sentences to get more space.

Looking at 00:32, the shortest syllables are still 1/4th of a beat. The gaps and the composed syllables (“sen”, sei”, “ren”, “ai”) help in making the pacing more digest. They give the listener time to catch up on the meaning.

Additional examples

魔女っぽいな (spoiler warning) by GainA
サーバント×ウガルル by Sakurei2015
塵紙人間 by ケツからスパゲッテ
ピクニック日野茜 by 芋タルト
Shinonome DISCO by beat_shobon

Catchphrase-focused

We say that limitation breeds creativity, I think Otomads are no exception. You’ve probably seen a few videos where the author gets more mileage than you’d expect from just a few seconds of footage. The simplicity of that kind of sentencing makes it a lot easier to remember, so those Otomads can often become earworms.

Multi-source becomes easier to do as you don’t have to worry about continuity.
There’s more repetition to make the most of your source, parts that sound worse can be discarded since the sentences don’t need to make sense.
Works better with instrumentals:
- It can follow the main melody or the percussion’s rhythm.
- It can be pitched like a lead, and even entirely replace it.
- It can do both at the same time with two different tracks.

Examples

havent uploaded in a while huh cuties? by Quality Control
ひなたのツッコミ by 豊臣秀吉の埋蔵金
- Precisely follows the lead and drums (00:42) of the music.
ダーリンがきたにょ by お皿タウンさ
- Fast-paced syllables, almost ignoring the original song.
リンゴの音MAD by Sakurei2015
- Constantly switches between sources relating to fruits.
テイク2 by 不覚暁
- Follows the rhythm of the lead but focuses more on arranging the sentence in every possible order than on the lead’s musical aspects.

Rhythm-focused

What happens when even words become optional? You get to the logical opposite of meaning-focused sentencing, where the rhythm dominates every decision.

A lot of EDM artists regularly do that. Like PSYQUII or t+pazolite.

Can be heavily driven by the lead and percussions.
- May even surpass the lead and add more to it.
Makes use of everything: words, yells, gasps, impacts, explosions, sfx, etc (I believe that any sound can become “sentence mixing” if it is contextually treated as such by the author).
Harder to do multi-source with:
- The listener can feel overloaded when sources switch every syllable.
- Finding sources that flow well together gets harder the more they switch around.

Examples

ENERGY 酔拳 MATRIX by Sakurei2015
- 00:13 start of the voice mixing
- 00:25 an extra layer of sentencing is added to the previous one, driven by the BGM
- 00:49 panned backing tracks
恋愛サーキュレーションでスーパーボンバーマンR モリモリスター by Sakurei2015
- Syllables are covering the lead melody.
- The sentencing has no discernable meaning, it’s made to sound nice.

Chaos

A powerful love letter to Otomads that makes for a great collab finale or a short, supercharged part.

Switches sources non-stop.
- Despite that, can be relatively easier to keep up with if you know the sources.
  - Then again, it can be pure chaos even for regular Otomad fans.
Can be an all-stars video or use many different scenes from a single source.
Will get you comments that say “drugs”.

Examples

The haunted dance part in practically every single medium-scale collaboration
BEAT-NICONICO-WORLD by namacream
ヤバい合作 ED by 阿保草
- Credit part featuring sources from all previous parts, goes at a fast but regular pace (1 source per beat)
  - Because each scene has a varying number of syllables, this video is a good case study of “how can I fit X syllables in a set timeframe”
- The lead sample also changes every measure, adding to the chaos
音MAD合作晒しイベント OP by Sakurei 2015 and Y.むるか.S
CAMELLO MULTIVERSE by camel ytpmv
M2 - Ultra Trailer by namacream
最終鬼畜オールスター

Mixing styles together

It’s common for more than one style to be used at once. Here’s some general advice when you do this.

Aim for 1 meaning-focused track at most.
- Two or more legible sentences played at the same time will immediately turn intelligible.
You don’t need that many tracks at once for an audio that sounds full.
- While sentencing can be a bunch of syllables without meaning, the sound they make is still more complex than regular samples, so you need less tracks for “complete-sounding sentencing” than for a “complete-sounding no-BGM”.
Unless you’re aiming for an “in your face” kind of feeling, you should introduce/switch the sentencing tracks one or two at a time.
- Adding complexity over time is a great way to make a long (> 1 min 30s) and repetitive BGM feel less repetitive.

Here are a few examples:

All the tracks in this video use sentence mixing: Although pitched, it’s all syllables and no samples.
It builds up over time: no more than 2 new tracks are added or switched at a time.
The lyrics are mixed in, but don’t have any other purpose than sounding good.
- This is a case full sentences that are NOT focused on meaning.
Here is a complete breakdown of the audio at 01:01.
- 4 tracks of sentence mixing are all it took to make the audio borderline chaotic.

01:31: Zetsubou stacks the sentence mixing part from the very beginning video with the one at 00:31.
- In this case, 2 is the limit. With another track, the result would become too chaotic. This is because the sentence mixing for both tracks is more complex.

01:39: A few tracks are playing at once, most of them would sound empty if isolated (like the bottom right one):
- 4 measures (8 seconds) later, the “yabai” track gets switched for the “yuuki” one.
- 2 measures later, the “oya wa henna…” track is added, since the previous “yuuki” track wouldn’t conflict too much, it doesn’t get removed, it just gets its volume lowered.
- Every 4 measures, The “yo” yells help make the transition smoother.

Starts with a focus on meaning.
At 00:54, switches to fast-paced catchphrase-focused sentence mixing.
- Despite not being focused on meaning, there is still a bit of continuity between some of the elements (ex. Aocchi -> Yunocchi) (Celeri tabeteru -> Usagi ga nigeteru).

Where we blur the line

The last thing you’d want to do after reading this article is to think that this is all there is to know about sentence mixing, or that you must adhere to only one of those styles. Hopefully the later sections have done a good-enough job to show that most authors will switch a lot between style and put their own twist on it, and that’s what you should do too!

Think of the styles I presented as a rough outline to get started when brainstorming for a video.

If you want to make a touching video on an anime you love, you’ll want to showcase scenes from across the whole season and include emotional dialogue. Because meaning would be so important, you might want to aim for meaning-focused sentencing.
Doing a Kenshin video? Catchphrase-focused sentencing would work very well with the many dubs the source has.
Doing a haunted dance video? Catchphrase-focused and utter chaos are both very good fits…

…But what if, those instructions were the very reason you’d want to do go with a different style? What would a story-driven Haunted Dance Otomad sound like? Those are the thoughts I’d like you to keep in mind.

So go crazy! Understand what makes the sentencing in the Otomads you love what it is, learn to identify the different techniques and patterns, and then Do the wild things no one’s ever seen!

With all that said, does it mean that an audio needs to have sentence mixing in order to tell a story? Fuck no.

Conclusion

In this article, we first revisited some of the basic item-editing tricks that are needed for sentence mixing, as well as what are the mistakes that can make your timing sound “off”.

Then, we took a broader and more opinionated look at some of the styles I was able to identify and label over time.

I don’t think the diagrams I showed in this article will be of any use as a blueprint, rather, they illustrate what I believe were some of the concessions the authors had to do when it comes to the balance between rhythm and meaning.

That should be about it, thanks for reading thus far, and I hope this has been a helpful read.

Tags: