In linguistics this is called a coarticulatory effect, and it's caused by needing to move the articulators between two positions rapidly. As such, it can be thought of as a kind of "hardware" limitation of humans, as opposed to a "software" limitation of any single language. Whether other languages would have the same sounds in sequence is the main factor.
The "ch" affricate (which is t͡ʃ in the IPA) is a mix of a voiceless alveolar stop component and a post-alveolar fricative component. Because "y" is palatal, you end up getting that post-alveolar fricative component through coarticulation.
Edit: here's an explanation without the jargon:
-
"t" in English is produced by your tongue contacting the ridge behind your top teeth.
-
"y" in this context is produced with the tongue sitting near the palate (significantly behind the ridge used for "t").
-
The English "ch" sound is actually a mix of two sounds: "t" and "sh", in rapid succession.
-
The "sh" sound is produced between the places where "t" and "y" are produced.
So, if you have a "t" and a "y" in quick succession, your tongue has to move quickly between a couple different spots -- and crucially, through the spot which produces "sh":
"t" -> "sh" -> "y"
And because "t" + "sh" equals "ch", you get the "ch" when producing this sequence. You can of course articulate things carefully and not produce it -- but in common, quick speech, that's why it shows up. Singing isn't different from speech in this regard.