Detecting Japanese text regions with VNDetectTextRectanglesRequest

This article focuses on a very specific workaround: scan a page with VisionKit, use VNDetectTextRectanglesRequest to find where Japanese words live on the image, then draw visible boxes over those regions so the result can feed a later OCR pipeline.

VisionKit document scan flow used before detecting Japanese text regions

The value of this article is its narrow goal: do not recognize Japanese text yet, just find where it is on the page.

At the time of this article, VNRecognizeTextRequest was not the right tool for Japanese text in this workflow. So the article takes one step back and solves the lower-level problem first: capture a document image, detect the text regions, and visualize those regions with bounding boxes.

Important distinction This is region detection, not full OCR. The output is a set of rectangles from VNTextObservation, which can later be cropped and passed into a custom recognition model.

VNDetectTextRectanglesRequest is useful when you need text location first and recognition second.

The article frames two reasons to use this request. First, it avoids depending on a recognizer that was not covering the target language in that setup. Second, it gives you a clean bridge into a custom OCR pipeline, because each detected region can be cut out and processed separately.

That makes the request a practical staging tool. Instead of trying to solve scanning, segmentation, and recognition in one jump, it isolates the segmentation layer and lets the rest of the pipeline evolve independently.

Use VNDocumentCameraViewController to capture a clean page image before any Vision request runs.

The first part of this article uses the built-in document scanner from VisionKit. Present the scanner, clear any old overlay layers from the previous run, then read the first scanned page inside the delegate callback.

import VisionKit

@IBAction func presentScanner() {
    let scanner = VNDocumentCameraViewController()
    scanner.delegate = self
    present(scanner, animated: true)

    imageView.layer.sublayers?.forEach { $0.removeFromSuperlayer() }
}

extension ViewController: VNDocumentCameraViewControllerDelegate {
    func documentCameraViewController(
        _ controller: VNDocumentCameraViewController,
        didFinishWith scan: VNDocumentCameraScan
    ) {
        if let cgImage = scan.imageOfPage(at: 0).cgImage {
            processImage(input: cgImage)
        }
        controller.dismiss(animated: true)
    }
}
Document scanning flow before Vision processes the image
The scanner stage produces the image that the Vision request will analyze next.

Run VNDetectTextRectanglesRequest and treat the result as a list of VNTextObservation boxes.

Once the image exists, the article switches to Vision. The request does not return recognized strings here. It returns geometry. Each VNTextObservation represents one detected text region in normalized coordinates.

import Vision

func processImage(input: CGImage) {
    let request = VNDetectTextRectanglesRequest { request, error in
        guard error == nil else { return }
        guard let results = request.results as? [VNTextObservation] else { return }

        for result in results {
            DispatchQueue.main.async {
                self.imageView.image = UIImage(cgImage: input)
                self.drawBoundingBox(for: result)
            }
        }
    }

    let handler = VNImageRequestHandler(cgImage: input, options: [:])
    DispatchQueue.global(qos: .userInteractive).async {
        try? handler.perform([request])
    }
}
What comes back In the sample result, the page contained five Japanese word regions and the request returned five matching observations. That is why this article is about layout extraction first, not text decoding yet.

Convert Vision's normalized coordinates into view coordinates and draw your own overlay layers.

The overlay step is the part that makes the result understandable. Vision gives the box in normalized image-space coordinates, so the article converts those values into the UIImageView's frame before drawing a green border layer.

func drawBoundingBox(for result: VNTextObservation) {
    let x = result.topLeft.x * imageView.frame.width
    let y = (1 - result.topLeft.y) * imageView.frame.height
    let width = result.boundingBox.width * imageView.frame.width
    let height = result.boundingBox.height * imageView.frame.height

    let outline = CALayer()
    outline.frame = CGRect(x: x, y: y, width: width, height: height)
    outline.borderColor = UIColor.green.cgColor
    outline.borderWidth = 3
    imageView.layer.addSublayer(outline)
}
Detected Japanese text regions shown with green bounding boxes
The finished overlay makes each detected Japanese word region visible on top of the scanned page.

This article also points out that you can move from word regions to character-level regions by working with characterBoxes instead. The drawing logic stays almost the same. Only the level of granularity changes.

The durable takeaway is to use this request as the front half of a custom Japanese OCR pipeline, while accepting that the boxes will not always be perfect.

This article is direct about the limitation: some boxes may be missed, and lighting or image quality can make the geometry less accurate. That is normal for a machine learning step like this. The goal is not perfection on the first pass. The goal is a usable segmentation stage.

From there, the next logical move is to crop the detected regions or characterBoxes and feed them into a custom recognition model trained for Japanese text. That is the handoff point where this article stops and a true OCR system begins.

Code This article links to a sample project here: mszopensource/VisionTextDetection.