なお本研究では、構造トークンの評価には、領域内の文字列を連結して編集距離を取る領域 CER ではなく、トークンの多重集合一致による F1 を用いるべきであるという方法論上の知見を得ました。領域 CER は、漢文の返り点が密に並ぶ行で 1 か所のずれが指標全体を大きく歪めるなど、構造評価では誤解を生む振る舞いを示します。F1 は順序非依存で外れ値にも頑健で、より実態に即した値を返します。
注釈構造ごとの Precision / Recall / F1 は次のとおりです(kana-fold 統一、<rt2> 除去後=デプロイ相当)。
構造
項目の定義
gold
P
R
F1
ふりがな
(漢字, 読み) ペアの完全一致
29,733
0.785
0.733
0.758
送り仮名
<OKURI> 内文字列の完全一致
4,478
0.646
0.725
0.683
返り点
<KAERI> 内文字列の完全一致
2,222
0.803
0.710
0.754
割書
span(右列+区切り+左列)の完全一致
2,259
0.411
0.423
0.417
ふりがなは「どの漢字に何の読みを付けるか」という難しい完全一致でも F1 0.76 を保ち、送り仮名は F1 0.68 と良好です。割書の F1 が低めなのは、1 スパンが長い 2 列テキストで「1 文字でも違えば不正解」という厳しい定義のためであり、「行に割書が在ること」自体の検出 F1 は約 0.69 と概ね正しく拾えています。返り点の F1 0.754 は前版(v12: 0.737)と同じ標準アノテーション上で測った値で、ほぼ同水準を保っています。ただし、みんなで翻刻のテキストには漢文の返り点が密に並ぶ行も多く含まれ、そうした行まで評価に加えると返り点 F1 は約 0.53 まで下がります。これは隣接する仮名や複数の文字を 1 つの返り点として取り込みすぎる過剰取り込みが主因で、本版の主要な改善対象です。
Minna de Honkoku OCR is an OCR system that, given a cursive Japanese (kuzushiji) classical book image, generates not only the body text transcription but also the surrounding annotation structure—furigana (ruby readings), okurigana and kaeriten (Japanese-style reading marks for Chinese-text passages), and warigaki (interlinear two-column notes). The output is structured text with tags preserving these annotation relations, making downstream typesetting, search, and reuse straightforward.
Dataset Provenance
Training data is sourced from Minna de Honkoku, a citizen-driven transcription platform launched in 2017 and jointly operated by the National Museum of Japanese History, the Earthquake Research Institute of the University of Tokyo, and the Kyoto University Historical Earthquakes Research Group. The platform hosts classical-book images from the IIIF digital archives of multiple holding institutions, including the National Diet Library, National Institute of Japanese Literature, University of Tokyo Library, Kyoto University Library, National Museum of Japanese History, Fukui Prefectural Digital Archives, and University of the Ryukyus Library. Volunteer transcribers attach transcription and annotations (furigana, kaeriten, etc.) to those images.
From the licensed portion of these transcriptions, we paired per-line position information (coordinates on the IIIF Image API) with transcription text and extracted approximately 1.2 million line-image / transcription pairs, organizing them into our custom webdataset_v3 format. Train/validation/test splits are deterministic by a hash of the title (entryId), so different lines from the same book never cross splits—the system is evaluated for book-level generalization.
This dataset is a scale-up of the Minna de Honkoku Dataset published by NDL Lab, which is also derived from Minna de Honkoku transcriptions. Following the NDL Lab dataset’s design as a foundation, we broaden the source materials and strengthen automatic line-image extraction and annotation-structure preservation (furigana, kaeriten, etc.), expanding the scale to ~1.2 million line pairs.
For automatic line-image extraction we use a page-layout / line-detection model trained on the NDL-DocL Dataset (document-image layout dataset), published by NDL Lab. Line bounding boxes detected on each page are cropped via the IIIF Image API. For aligning each cropped line image with a single line of transcription text (i.e., which bbox corresponds to which line of text), we use the parseq ONNX model from NDL Kotenseki OCR-Lite as a recognition prior: we compute the edit distance (Levenshtein distance) between each tentative parseq output and each transcription line, normalize by length, and accept only pairs whose normalized distance falls below a fixed threshold. This filters out noisy or clearly misaligned correspondences and yields a mechanically validated text–image mapping.
The constructed dataset will be released under an open license in the near future, in a form usable for both academic and commercial purposes. Details will be announced in this repository.
Data Preprocessing
Transcriptions use a notation specific to Minna de Honkoku. For example, “漢字(かんじ)” or “《振り仮名:漢字|かんじ》” encode furigana; “ ̄ニ” or “[ニ]” encode okurigana; “_レ” or “{レ}” encode kaeriten; and “《割書:右|左》” encodes warigaki. As a first step, all of these are normalized into special training tokens: <ruby>, <rt>, <OKURI>, <KAERI>, <WARI>, etc. To absorb kana-orthographic variation across transcribers, we fold isolated single-character katakana into hiragana, and also unify 464 classical-form (kyujitai) characters to their modern (shinjitai) counterparts via the kyujipy mapping. Note that the rare “second-reading” ruby (a furigana on the left side of the kanji, encoded as <rt2>) appears very rarely in the gold and the model used to over-produce it; in the current version it is removed at preprocessing time and excluded from the vocabulary.
Model Architecture
The line recognizer is a Vision-Encoder-Decoder neural network that generates structured text from images, implemented on top of HuggingFace’s VisionEncoderDecoderModel framework.
The encoder uses ConvNeXt V2 at the Base scale (~88M parameters). ConvNeXt V2 adds self-supervised pretraining via a fully convolutional masked autoencoder (FCMAE) and a Global Response Normalization (GRN) layer; we start from weights fine-tuned on ImageNet-22k at 384px. Each vertical line is rotated 90° to a horizontal orientation and resized (aspect-preserving) to height 256px and width up to 2048px (preserving an 8:1 aspect ratio). With a cumulative stride of 32, the encoder produces an 8×64 feature map (512 vectors of 1024 dimensions). On top of this we add a custom learned 2D positional embedding indicating “which row and column of the line is this feature from” (8 row embeddings and 72 column embeddings; we add the appropriate row and column embeddings to each cell and apply LayerNorm).
The decoder is a small RoBERTa-based Transformer (6 layers, hidden size 512, 8 heads). It is initialized from a RoBERTa pretrained on a ~57 million-character corpus assembled from Minna de Honkoku transcriptions, using a masked language modeling (MLM) objective. In other words, the decoder starts from a language model already adapted to the target domain. We then transfer it to the OCR task; the cross-attention is trained from scratch. Generation is autoregressive, one token at a time, from <CLS> until <SEP>. The vocabulary is a character-level tokenizer consisting of the top 5,000 most frequent characters in the corpus, and it colocates 11 structural special tokens alongside those characters, so that character recognition and annotation tagging happen jointly in a single generation pass.
Figure: Model architecture of Minna de Honkoku OCR
Training
Optimization uses AdamW with two learning-rate groups: the newly initialized modules (2D positional embeddings, the encoder/decoder projection layer, and cross-attention) at a high rate of 1×10−3, and the encoder backbone and the pretrained decoder body at a low rate of 5×10−5. Training runs in bfloat16 autocast with effective batch size 64 for roughly 5 epochs on a single NVIDIA A100. Raising the input resolution from 192×1536 to 256×2048 (~1.78× the pixels) increases memory consumption, so we shrink the micro-batch and increase gradient accumulation to keep the effective batch size unchanged. The learning rate is annealed linearly in the final segment, and as in the previous version the validation CER drops a further notch during that anneal.
The loss is cross-entropy with label smoothing 0.1, augmented with weighted CE (weight 2.0 on structural tokens). We additionally add a focal-style regularizer that penalizes high <SEP> probability at non-terminal positions (coefficient 0.5) to discourage the model from cutting lines short. To compensate for the low frequency of kaeriten and okurigana, we oversample the corresponding training lines by 2.0× each. Augmentation via albumentations randomly applies elastic distortion, morphological operations, Gaussian noise, resolution reduction, and JPEG compression artifacts.
Evaluation
On a book-level test split of about 62,000 lines, with kana variation normalized, plain text CER is approximately 0.087 (~91.3% character-level accuracy). The previous version (v12) achieved CER 0.094 (~90.6%) under the same canonicalization, so replacing the encoder with ConvNeXt V2 and raising the input resolution from 192×1536 to 256×2048 cut the error rate by about 7%. Annotation-tagging F1 reaches ~88% for “which kanji receives furigana,” ~76% for “exact (kanji, reading) pair match,” ~68% for okurigana, ~75% for kaeriten, and ~42% for warigaki.
About the CER figures (important). The CER reported here is measured against the volunteer transcriptions of Minna de Honkoku as ground truth. However, those transcriptions have not yet undergone expert review: a separate sampling survey over ~100,000 characters estimated the transcriptions’ own accuracy at about 98.5%. The numbers above are therefore agreement rates against an imperfect reference—errors in the transcription can be counted as model errors and vice versa—and do not strictly equal the model’s true recognition accuracy. We plan a more rigorous CER measurement using an expert-reviewed test set in the future.
Methodologically, structural tokens are best evaluated with multiset-token F1 rather than region CER. F1 is order-invariant and outlier-robust, and gives a more realistic picture.
Furigana keeps F1 0.76 even under the strict “which kanji gets which reading” exact match, and okurigana reaches F1 0.68. Warigaki F1 is relatively low because its unit is a long two-column string and any single-character mismatch fails the span, but a looser “does the line contain warigaki at all” detection reaches F1 ≈ 0.69. The kaeriten F1 of 0.754 is measured on the same standard annotation as the previous version (v12: 0.737), so it holds essentially steady. However, Minna de Honkoku also contains many lines of densely-marked Chinese (kanbun) text; including those lines drops the overall kaeriten F1 to about 0.53, mainly because the model over-captures adjacent kana or multiple characters into a single return mark—the main improvement target for this version.
Per-host plain micro CER (hosts with n ≥ 300, plus two especially-accurate small hosts; weighted average 0.087):
Holding institution (IIIF host)
n
CER
Ryukoku University Libraryda2.library.ryukoku.ac.jp
212
0.032
Ritsumeikan University ARCwww.arc.ritsumei.ac.jp
231
0.057
Kyoto University Libraryrmda.kulib.kyoto-u.ac.jp
1,192
0.061
Fukui Prefectural Digital Archiveswww.digital-archives.pref.fukui.lg.jp
6,924
0.064
University of Tokyo Libraryiiif.dl.itc.u-tokyo.ac.jp
8,606
0.069
National Diet Librarydl.ndl.go.jp
20,875
0.075
ADEACdcfs.trc-adeac.co.jp
2,129
0.081
University of the Ryukyus Libraryshimuchi.lib.u-ryukyu.ac.jp
917
0.083
National Museum of Japanese Historykhirin-a.rekihaku.ac.jp
3,580
0.093
Tokyo Gakugei University Libraryd-archive.u-gakugei.ac.jp
1,108
0.107
amane projectourarchives.amane-project.jp
6,473
0.108
National Diet Library (legacy)www.dl.ndl.go.jp
3,590
0.113
Kyushu University Librarycatalog.lib.kyushu-u.ac.jp
1,132
0.117
National Institute of Japanese Literaturekokusho.nijl.ac.jp
The corpus-wide best is Ryukoku University Library at 0.032, followed by Ritsumeikan University ARC at 0.057, though both span only a few hundred lines and therefore carry larger CER variance. Among higher-volume hosts, Kyoto University Library reaches 0.061 and the largest sample, the National Diet Library, reaches 0.075, whereas NIJL classical books sit around 0.15—clearly reflecting the range of script and page-layout difficulty. The worst hosts (cdm16028.contentdm.oclc.org, os3-…-sakura.ne.jp) are outliers in image quality and source material, and are targets for individual analysis going forward.
Inference
Processing in the browser is a two-stage pipeline: first layout recognition detects the body-text line regions in the input image, then line OCR (enc-dec recognition) runs on each detected line. For layout recognition we use the RTMDet-s model from NDL Kotenseki OCR-Lite (switchable in settings to this system’s own YOLOv8 model). The reading order of the detected line boxes is fixed during the layout stage by a text-free XY-Cut.
The trained model is exported to ONNX and compressed via quantize_dynamic (dynamic int8 quantization). The encoder is shipped as a single file (~89 MB), while for inference speed the decoder is split into two files: a prefill graph (~32 MB) that runs once on <CLS> and constructs the internal Key/Value cache, and a step graph (~28 MB) that consumes the cache and a single new token at each iteration. This means decoder self-attention shifts from “recompute over all tokens generated so far” to “extend by one new token,” giving a theoretical 5–10× speedup on greedy decoding. The graphs are executed in the browser’s WebAssembly environment using onnxruntime-web.
Several post-processing steps run at inference time. First, small per-line tilts are estimated with a projection-profile method and corrected before recognition. Second, a safety guard terminates generation if the output begins repeating the same token or a short cyclic pattern (end-of-line collapse).
Future Work
Low-resolution images, severely deformed scripts, and elements written as two columns within a single line (warigaki) remain failure modes. Replacing the encoder with ConvNeXt V2 and raising the input resolution in this version (v13) steadily improved body-text CER, but the absolute level for warigaki is still low; we plan further resolution increases and targeted additional training on specific source materials. For kaeriten, the next target is suppressing over-capture—the tendency to absorb adjacent kana or multiple characters into a single return mark. We are also exploring grammar-constrained decoding at inference time (to enforce tag consistency) and integration with a language model (to correct visually confusable characters). As noted above, the current CER is measured against unreviewed transcriptions as ground truth; we plan to re-evaluate on a rigorous test set that has undergone expert review.