@David, tx for the video. Like the silhouette/fades graphic, but also experienced divided intention. Re @Steve on proximity, yes, Captions and images are a good non-example of spatial contiguity (Mayer), although admittedly these words are not THAT far away from the graphic.
For Step one:
Hands and arms swing forward and upward. I read this, looked at graphic, reread text, looked again.
Head raises and trunk extends.Same deal
@David, Like the rollover idea. And, in this example, maybe a 2-step rollover as the 2 "actions" take place.
My 2 cents. Good exercise!