ARKit allows you to create and play with augmented realities, creating a new way for users to interact with digital content and the world around them. But what if you could have your model not only augment reality but also interact with and react to the changes around itself?

Image showing a wireframe cube around a potted plant

Placing a cube wireframe around the plant was our end result

This post is from the multi-part series on ARKit, where we talk about designing for AR, building a demo app, exploring and testing many of the features of ARKit. We previously wrote on Adding the finishing touches to 3D models in Xcode.

In the previous post in this series we played around with AR and added a model to a scene. While doing so we wondered if there was a way to add a model to a specific object of the world. So we turned to Machine Learning to see if by recognizing the scene we could place a 3D model seamlessly into an environment. Furthermore we wanted to create a recipe for adding ML capabilities to an app that doesn’t require specialized knowledge of image recognition, Neural Networks, or Machine Learning in general.

We experimented on CoreML with a TensorFlow model, as well as the YOLO framework. Our goal was to find the most clearly identified object within the scene and automatically add a 3D model to it. Read on to find out about our experiences with the different approaches we tried, and each one’s comparative advantages.

For an easy to read introduction to Machine Learning check out this post by Alex Styl on understanding machine learning

Exploring our options

We decided to work with pre trained Machine Learning models and the 3D models created in the first part of this series Designing for ARKit.

The focus of our investigation for this demo was not training a machine learning model but rather seeing how easily you can implement machine learning into an AR app and how straightforward CoreML is to use on iOS.

Core ML delivers blazingly fast performance with easy integration of machine learning models enabling you to build apps with intelligent new features using just a few lines of code.[1]

With Core ML, you can integrate trained machine learning models into your app. Apple Documentation

Another important distinction before we move on is the difference between image recognition and object detection. Image recognition is a machine learning term for when a model is trained to identify what an image contains:

Image recognition is the process of identifying and detecting an object or a feature in a digital image or video.


Image recognition:

Image detection app from the CoreML udacity course

Object detection on the other hand is the process of a trained model detecting where certain objects are located in the image.

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos.


Object detection:

This is what the TinyYolo CoreML by Matthijs Hollemans model output looks like

All the pre trained models Apple gives us for CoreML are built for image identification instead of object detection, so we knew that we had to convert an object detection model to CoreML. [2]
For this exercise we also used Apple's Vision Framework since it processes images and would prepare input from ARSCNView for the model.

Apply high-performance image analysis and computer vision techniques to identify faces, detect features, and classify scenes in images and video.

We decided to use CoreML because its implementation is straightforward once the model has been imported into the app. CoreML reads as a regular Swift class which means it can be implemented by mobile developers without having to worry about machine learning concepts. That makes it a really powerful framework, the caveat is that your trained model might end up being a large file that a user will have to download with the app (using disk space on the phone) but since the model lives on the phone it also allows the machine learning functionalities to be used on the device instead of through a cloud. Doing so allows for object detection to happen even without an internet connection.

During WWDC 2018 Apple announced that their Vision Framework could now recognize and detect objects on live capture camera, making it perfect for this tutorial without the need to use external machine learning models or conversions. Because of that we updated the code that goes with this blog to use Vision to test how it worked, but also left the existing code and documentation for how to use a converted tensorflow model called Tiny YOLO.

You can check out the new Vision Model working on our github. For both machine learning models, the 3D model used and how is placed on the ARScene is the same.

Using a Tiny YOLO model

Apple provides different ways to convert a model to CoreML. Options include a Python package called CoreML Tools, maintained by Apple itself, which converts any python model to CoreML; Apache MXNet, maintained by Apache, which can train MXNet models and then convert them to CoreML, or the recently added TensorFlow converter, maintained by Google, which allows us to convert some of the best known TensorFlow models into usable CoreML.

We looked into converting a tensorFlow model into CoreML using the converter but found that the graph for the object detection model was not optimized to be understood by the converter. While considering different object detection models we found this article by Matthijs Hollemans in which he uses a YOLO framework, converts it to CoreML and implements it to an ios app.[3]

YOLO is a neural network made up of 9 convolutional layers and 6 max-pooling layers, but its last layer does not output a probability distribution like a classifier would. Instead the last layer produces a 13×13×125 tensor.
This output tensor describes a grid of 13×13 cells. Each cell predicts 5 bounding boxes (where each bounding box is described by 25 numbers). We then use non-maximum suppression to find the best bounding boxes.

The YOLO model converted to CoreML will get an input of an image and output the 13x13 cells of every objects it detects. We then need to convert this grid into CGPoint coordinates for our scene. In his post Matthijs goes over how to convert this 25 numbers to actual Floats values needed for the prediction. Read the post and take a look at his code if you want to understand YOLO and CoreML a little better.

We used his code, with his permission, to predict where our object is located.There are few differences between our implementation and his, notably that we used an ARScene as the input image, and that in our case the output is not an array of bounding boxes, but a single box marking the most prominent object in the scene.

Our first step was importing the TinyYOLO ML model into the app and setting up the TinyYolo class, this class converts the grid into a Prediction structure. Once we had this set up we started editing the View Controller code from the ARKit app. On the ARKit app we were using user tap events to see where to place the 3d object and now we want to use the point given by the machine learning model to place the object there.

The first thing we need to do is pass the current AR scene view to the model so it can analyze it and find the mostprominent object on the screen.

let pixelBuffer = sceneView.session.currentFrame?.capturedImage

However if we pass that pixelBuffer straight into the YOLO model it will return an error, not a prediction. This happens because the model expects a specific size image as input, meaning we need to scale the image to fit the model input (which in this model is 416x416).

To get the input processed we could manually scale the image to a 416x416 or we could wrap the model objects into a Vision request. Vision, as I mentioned before, processes images for Machine Learning models and will scale our input image,pass it through the CoreML model and get the prediction results.

guard let visionModel = try? VNCoreMLModel(for: yolo.model.model) else {
      print("Error: could not create Vision model")
let request = VNCoreMLRequest(model: visionModel, completionHandler: visionRequestDidComplete)

We set up a vision model as a wrapper of our CoreML model and then create a Vision Request at the start of the app. Once we want to perform the request to the model we create a VNImageRequestHandler with the pixelBuffer we want to use and perfom the request.

let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer)
try? handler.perform([request])

Before we can perform the request there are still a couple of things we need to set up. One is to perform the request on a background thread so it does not actually block the UI of the app and the second is to get the Dispatch Semaphore to wait. We initialized the semaphore with a value of 2 to avoid possible repeated calls to the model before we get the results.

let semaphore = DispatchSemaphore(value: 2)
You will need to create the dispatch semaphore at the beginning of the app and then set it to wait before performing the first request and before going into the background thread.

Once the model has finished processing the image we will get the result prediction on our visionRequestDidComplete; we can then get the results as a multi array value and pass that to the Yolo class to get the outline CGRect boxes of the objects detected.

if let observations = request.results as? [VNCoreMLFeatureValueObservation],
   let features = observations.first?.featureValue.multiArrayValue {

   let boundingBoxes = yolo.computeBoundingBoxes(features: features)
computeBoundingBoxes is a method created by Matthijs Hollance that accompanies the TinyYolo model. He explains the transformation on his blog

Using Object Detection in Vision

We now need to change how we get the location of the model but can maintain the rest of the AR code. Instead of getting a CGPoint from the touch location we get a CGRect from the model of where the object is and use those points against ARKit's Hit Points to find where we should place the model on a 3D AR scene.

To do any of the AR code we will need to return to main thread and interact with the UI, but before we do that we should free the semaphore allowing for other requests to be completed if necessary: self.semaphore.signal()

guard let scaledRect = yolo.scaleImageForCameraOutput(predictionRect: prediction.rect, viewRect: self.view.bounds) else {
     print("could not scale the Point vectors")

guard let model = arViewModel.createSceneNodeForAsset(nodeName, assetPath: "art.scnassets/\(fileName).\(fileExtension)") else {
     print("we have no model")
let scaledPoint = CGPoint(x: scaledRect.origin.x, y: scaledRect.origin.y)
if let hitPoint = arViewModel.getHitResults(location: scaledPoint, sceneView: sceneView, resultType: [.existingPlaneUsingExtent, .estimatedHorizontalPlane]) {
     let pointTranslation = hitPoint.worldTransform.translation
     model.position = SCNVector3(pointTranslation.x, pointTranslation.y, pointTranslation.z)

Once we are back to the main thread and ready to place the model on the scene we will first need to scale back the CGRect prediction into the proportions of the sceneView. This scaling code was done by Matthijs Hollance on his github code assuming the scene where the image comes from is full screen of a phone.

Once we have the scaled CGRect position of the prediction we create the model, get a hit point of where the model is supposed to go [FOOTNOTE] and use that hit point to change the model position and add it to the sceneView.rootNode

And just like that you can place a box wireframe inside a potted plant. That in itself doesn’t seem very useful or impressive, but we can use the same project set up to calculate the space between two objects and place a model there only if there is enough space, or using the code to automatically place a new color on a wall or floor while avoiding the objects that get in the way, etc.

There a lot of possibilities you can get from combining both frameworks, not only making your AR apps smarter but also helping make your Machine Learning apps more visual.


To summarise, in this post we went through the basics of CoreML and expanded onto the usage of ARKit by having it place models around existing objects.

The code for this demo can be found on Novoda’s GitHub and you can also check our ARDemoApp repo, where you can import your own models into an AR Scene without having to write a line of code.

Have any comments or questions? Hit us up on Twitter @bertadevant @KaraviasD

  1. Core ML is optimized for on-device performance, which minimizes memory footprint and power consumption. Running strictly on the device ensures the privacy of user data and guarantees that your app remains functional and responsive when a network connection is unavailable. ↩︎

  2. This post was updated on January 2019 to reflect that Apple has announced Object Detection using Vision. The code on the demos has also been updated. Novoda’s GitHub ↩︎

  3. Matthijs is an incredible person and developer and you should take a look at his full project, blog and twitter. ↩︎