Gaze interaction presents a promising avenue in Virtual Reality (VR) due to its intuitive and efficient user experience. Yet, the depth control inherent in our visual system remains underutilized in current methods. In this study, we introduce FocusFlow, a hands-free interaction method that capitalizes on human visual depth perception within the 3D scenes of Virtual Reality. We first develop a binocular visual depth detection algorithm to understand eye input characteristics. We then propose a layer-based user interface and introduce the concept of "Virtual Window" that offers an intuitive and robust gaze-depth VR interaction, despite the constraints of visual depth accuracy and precision spatially at further distances. We also design a learning procedure that uses different stages of visual cues to guide novice users in mastering depth control. Our user studies on 24 participants demonstrates the usability of our proposed virtual window concept as a gaze-depth interaction method. In addition, our findings reveal that the user experience can be enhanced through an effective learning process with even weak visual cues, helping users to develop muscle memory for this brand-new input mechanism. We conclude the paper by discussing strategies to optimize learning and potential research topics of gaze-depth interaction.