The following paragraphs will guide you through our journey of the knowns and unknowns within the AR world — starting with one of our engineer’s dreams and resulting in a prototype that’s literally at our fingertips.
Where Is the Finger?
Our first task was to find the tip of the index finger's location. With some experience in computer vision, our starting point was more or less straightforward: use the Vision framework to help us detect the finger in the video feed — provided to us by the system — which we then use as the steering wheel on our canvas. You can take a small detour to learn more about how Vision works and how simple hand detection is done in the WWDC video Detect Body and Hand Pose with Vision.
Although it was simple to start detecting the finger, we soon hit some unknowns.
Firstly, the raw image data provided to us with each frame was oriented in landscape mode even though we were holding the camera in portrait. With this in mind, we created an VNImageRequestHandler
instance passing CGImagePropertyOrientation.right
orientation so that the image data would rotate 90° clockwise from the image's intended display orientation.
Secondly, the coordinate space in which Vision operates is normalized from 0.0
to 1.0
with a lower left origin and these coordinates are relative to the image data provided in the VNImageRequestHandler(cvPixelBuffer:orientation:)
instance. However, the screen coordinate space has an upper left origin so, to convert the result back to the screen space, we needed to do simple math.
let locationInScreenSpace = CGPoint(
x: indexTip.x,
y: 1 - indexTip.y
)
Lastly, even with the correct conversion, something was still wrong with the results we were receiving. For some reason, the detected position appeared to be correct only in the center areas of the screen; as we moved the finger away to the left or right, the precision decreased.
You may be guessing why this happened. The reason was simple: we were trying to convert a normalized point from one coordinate space to another, without realizing that those two might have different aspect ratios. In other words, the image data we provided to Vision for processing had a different aspect ratio (4:3 on iPhone 13 Pro) compared to our device screen (19.5:9 on iPhone 13 Pro). And since the results were relative to the original image, we needed to look at them in terms of where they would appear on the screen.
An illustration of the problem:
Following this realization, it was a matter of math to make the correct conversion. We found the tip of the index finger and were able to see the results at the correct positions!
Before & after math correction:
The Higher Dimension
Seeing the Depths of the World
To access the index finger’s real position, we needed to move from the 2D world to the 3D space. Moving to another dimension is never an easy task but, luckily, we had a LiDAR scanner with us because, without it, it would’ve been nearly impossible.
Thanks to LiDAR technology, we could see the world from a different perspective, visualizing the depths of the world around us. Whenever we moved, we also needed to use the depth map to check how far away things were from us. And since the depth map is a plain 2D array of data, the previously-gained experience in finding the index finger found its purpose — as we used the position in the 2D image to obtain the finger's location in the depth map.
Note that, for some reason, the first index does not start at the top-left corner (as we may be used to) but in the bottom-right corner.
func value(column: Int, row: Int) -> Float? {
guard CVPixelBufferGetPixelFormatType(self)
== kCVPixelFormatType_DepthFloat32
else { return nil }
defer { CVPixelBufferUnlockBaseAddress(self, .readOnly) }
CVPixelBufferLockBaseAddress(self, .readOnly)
guard let baseAddress = CVPixelBufferGetBaseAddress(self) else {
return nil
}
let width = CVPixelBufferGetWidth(self)
let index = column + row * width
let offset = index * MemoryLayout<Float>.stride
return baseAddress.load(fromByteOffset: offset, as: Float.self)
}
Moving to the 3D World
This part of the journey was particularly tricky due to our lack of expertise in 3D computer graphics. We were aware of what needed to be done — take the position of the index finger on the screen with its depth and convert it to the real position in the world space — but, naturally, we did not know where or how to start.
After many failed attempts, we bumped into rickster's thorough answer on StackOverflow. Then, with a few lines of code, we were able to move from a simple 2D screen coordinate to the real 3D position of the index finger.
func unprojectPoint(_ point: CGPoint, z: Float) -> SCNVector3 {
let projectedOrigin = sceneView.projectPoint(SCNVector3(0, 0, z))
let viewPointWithZ = SCNVector3(
Float(point.x), Float(point.y), projectedOrigin.z
)
return sceneView.unprojectPoint(viewPointWithZ)
}
Drawing in the Air
Finally, we jumped into the last part of the journey which involved a great number of experimenting with custom geometries and math.
Since SceneKit only provides geometries for primitive shapes, such as a cube, a cylinder, a sphere, etc., we needed to create our own custom shape that would represent a single brush stroke. Drawing with the predefined geometries wouldn't work, since we’d be adding a new node per each frame. With 30 to 60 FPS, that would make the application extremely inefficient and awfully low-performant.
So, on the mission to create a shape representing a brush stroke, we met some interesting creatures worth mentioning. We encountered many SCNVector3
in multiple places in SceneKit but soon realized their lack of fundamental capabilities — like mathematical operations. It was, however, not their fault; they were not designed to do the math but, rather, to represent data, such as nodes or vertex positions.
As it turned out, there was another, similar type called SIMD3
— specifically, SIMD3<Float>
sometimes referred to as simd_float3
— that was very knowledgeable in mathematics. With its help, we could define vertices that would eventually connect to create the brush stroke.
What kind of brush stroke would it be if it couldn't be curvy and turn in whatever direction our finger was going? Since remembering all analytic geometry and linear algebra wasn’t happening, we were pleased to meet quaternions simd_quatf
. It was because of their set of capabilities that we didn’t have to calculate the brush stroke's angle of rotation using vector dot products, cross products, matrices and whatnot.
For instance, to obtain a rotation between two directions, we’d simply call simd_quaternion(_:_:)
. Then, when we wanted to use the rotation to rotate a vertex, we asked the quaternion to "act" on it with the act(_:)
method.
We highly encourage reading the Working with Quaternions article to learn more.
An example of drawing in the air:
Conclusion
Despite little to no experience with a similar journey — leading to a stressful, exciting, very challenging yet rewarding and enlightening experience — we managed to create a prototype that fulfilled our initial dream, in which the finger is your brush and the world is your canvas.
Would you like to join the STRV team? We're hiring!