Detecting Japanese text regions with VNDetectTextRectanglesRequest

Overview

The value of this article is its narrow goal: do not recognize Japanese text yet, just find where it is on the page.

At the time of this article, VNRecognizeTextRequest was not the right tool for Japanese text in this workflow. So the article takes one step back and solves the lower-level problem first: capture a document image, detect the text regions, and visualize those regions with bounding boxes.

Important distinction This is region detection, not full OCR. The output is a set of rectangles from VNTextObservation, which can later be cropped and passed into a custom recognition model.

Why

`VNDetectTextRectanglesRequest` is useful when you need text location first and recognition second.

The article frames two reasons to use this request. First, it avoids depending on a recognizer that was not covering the target language in that setup. Second, it gives you a clean bridge into a custom OCR pipeline, because each detected region can be cut out and processed separately.

That makes the request a practical staging tool. Instead of trying to solve scanning, segmentation, and recognition in one jump, it isolates the segmentation layer and lets the rest of the pipeline evolve independently.

Step 1

Use `VNDocumentCameraViewController` to capture a clean page image before any Vision request runs.

The first part of this article uses the built-in document scanner from VisionKit. Present the scanner, clear any old overlay layers from the previous run, then read the first scanned page inside the delegate callback.

import VisionKit

@IBAction func presentScanner() {
    let scanner = VNDocumentCameraViewController()
    scanner.delegate = self
    present(scanner, animated: true)

    imageView.layer.sublayers?.forEach { $0.removeFromSuperlayer() }
}

extension ViewController: VNDocumentCameraViewControllerDelegate {
    func documentCameraViewController(
        _ controller: VNDocumentCameraViewController,
        didFinishWith scan: VNDocumentCameraScan
    ) {
        if let cgImage = scan.imageOfPage(at: 0).cgImage {
            processImage(input: cgImage)
        }
        controller.dismiss(animated: true)
    }
}

Document scanning flow before Vision processes the image — The scanner stage produces the image that the Vision request will analyze next.

Step 2

Run `VNDetectTextRectanglesRequest` and treat the result as a list of `VNTextObservation` boxes.

Once the image exists, the article switches to Vision. The request does not return recognized strings here. It returns geometry. Each VNTextObservation represents one detected text region in normalized coordinates.

import Vision

func processImage(input: CGImage) {
    let request = VNDetectTextRectanglesRequest { request, error in
        guard error == nil else { return }
        guard let results = request.results as? [VNTextObservation] else { return }

        for result in results {
            DispatchQueue.main.async {
                self.imageView.image = UIImage(cgImage: input)
                self.drawBoundingBox(for: result)
            }
        }
    }

    let handler = VNImageRequestHandler(cgImage: input, options: [:])
    DispatchQueue.global(qos: .userInteractive).async {
        try? handler.perform([request])
    }
}

What comes back In the sample result, the page contained five Japanese word regions and the request returned five matching observations. That is why this article is about layout extraction first, not text decoding yet.

Step 3

Convert Vision's normalized coordinates into view coordinates and draw your own overlay layers.

The overlay step is the part that makes the result understandable. Vision gives the box in normalized image-space coordinates, so the article converts those values into the UIImageView's frame before drawing a green border layer.

func drawBoundingBox(for result: VNTextObservation) {
    let x = result.topLeft.x * imageView.frame.width
    let y = (1 - result.topLeft.y) * imageView.frame.height
    let width = result.boundingBox.width * imageView.frame.width
    let height = result.boundingBox.height * imageView.frame.height

    let outline = CALayer()
    outline.frame = CGRect(x: x, y: y, width: width, height: height)
    outline.borderColor = UIColor.green.cgColor
    outline.borderWidth = 3
    imageView.layer.addSublayer(outline)
}

Detected Japanese text regions shown with green bounding boxes — The finished overlay makes each detected Japanese word region visible on top of the scanned page.

This article also points out that you can move from word regions to character-level regions by working with characterBoxes instead. The drawing logic stays almost the same. Only the level of granularity changes.

Next Step

The durable takeaway is to use this request as the front half of a custom Japanese OCR pipeline, while accepting that the boxes will not always be perfect.

This article is direct about the limitation: some boxes may be missed, and lighting or image quality can make the geometry less accurate. That is normal for a machine learning step like this. The goal is not perfection on the first pass. The goal is a usable segmentation stage.

From there, the next logical move is to crop the detected regions or characterBoxes and feed them into a custom recognition model trained for Japanese text. That is the handoff point where this article stops and a true OCR system begins.

Code This article links to a sample project here: mszopensource/VisionTextDetection.

Detecting Japanese text regions with VNDetectTextRectanglesRequest

The value of this article is its narrow goal: do not recognize Japanese text yet, just find where it is on the page.

VNDetectTextRectanglesRequest is useful when you need text location first and recognition second.

Use VNDocumentCameraViewController to capture a clean page image before any Vision request runs.

Run VNDetectTextRectanglesRequest and treat the result as a list of VNTextObservation boxes.