技術情報

2026年6月22日執筆（v16 対応版）

「みんなで翻刻OCR」は、くずし字で書かれた古典籍の画像から、本文の翻刻に加え、ふりがな・送り仮名・返り点・割書（双行注記）といった注釈構造を含む電子テキストを一括して生成するOCRシステムです。出力は単純な文字列ではなく、原文に備わる注釈関係をタグで保持した構造化テキストとなっており、後段の組版表示・検索・二次利用に適した形式になっています。以下、データセット・前処理・モデル構成・学習・評価・推論時の工夫について順を追って述べます。

データセットの由来

学習データは、市民参加型の翻刻プラットフォーム「みんなで翻刻」（2017年公開、国立歴史民俗博物館・東京大学地震研究所・京都大学古地震研究会が共同運営）で公開されている翻刻成果に由来します。同プラットフォームには、国立国会図書館、国文学研究資料館、東京大学附属図書館、京都大学附属図書館、国立歴史民俗博物館、福井県デジタルアーカイブ、琉球大学附属図書館をはじめとする複数の所蔵機関の IIIF デジタルアーカイブから提供された古典籍画像が登録されており、ボランティアの翻刻者の方々がそれらに翻刻と注釈（ふりがな・返り点等）を付与しています。

本研究では、利用許諾が確認できる範囲の翻刻データから、行ごとの位置情報（IIIF Image API 上の座標）と翻刻テキストを組み合わせて、行画像と翻刻文の組をおよそ 120 万件抽出し、独自の webdataset_v3 形式に整備しました。学習・検証・テストへの分割は書名（entryId）のハッシュで決定的に行っており、同じ書物の異なる行が学習側とテスト側に混在しないため、書物単位での汎化性能を評価できます。

本研究で構築したデータセットは、NDLラボがみんなで翻刻のデータをもとに構築・公開している「みんなで翻刻データセット」を規模的に拡張したものです。NDLラボのデータセットが基盤となる方針を踏襲しつつ、対象資料を広げ、行画像の自動抽出と注釈構造（ふりがな・返り点等）の保持を強化することで、約 120 万行の規模に拡張しました。

行画像の自動抽出には、NDLラボが公開している「NDL-DocL データセット（資料画像レイアウトデータセット）」で訓練した版面・行レイアウト認識モデルを用いました。各画像から行 bbox を検出し、IIIF Image API 経由で行画像を切り出します。続いて切り出した行画像と翻刻テキスト 1 行とのアラインメント（どの bbox にどの翻刻行が対応するか）には、NDL古典籍OCR-Lite の parseq ONNX モデルを補助として用います。parseq で得た暫定認識結果と翻刻テキスト各行との編集距離（Levenshtein 距離）を計算し、文字数で正規化した距離が一定のしきい値を下回るペアのみを有効な対応関係として採用することで、ノイズの多い対応や明らかなずれを除外し、テキスト・画像の対応関係を機械的に確定させています。

本研究で構築したデータセットは、近日中にオープンライセンスで公開予定です。コミュニティが学術用途・商用用途いずれにも活用できる形での提供を予定しており、詳細は本リポジトリで追って告知します。

データ前処理

翻刻文には「みんなで翻刻」独自の記法が用いられています。例えば「漢字（かんじ）」や「《振り仮名：漢字｜かんじ》」でふりがな、「￣ニ」や「［ニ］」で送り仮名、「＿レ」や「｛レ｝」で返り点、「《割書：右｜左》」で割書を表現するものです。学習に先立ち、これらをすべて、<ruby> <rt> <OKURI> <KAERI> <WARI> といった学習用の特殊トークンへ正規化します。さらに、翻刻者によって表記が揺れがちな仮名（同じ助詞が資料ごとに「ニ」と「に」で書かれるなど）を抑えるため、孤立した1字のカタカナをひらがなへ畳み込む処理を加えています。一方で旧字体・異体字は新字体へ統一せず、原文の字形をそのまま忠実に保持する方針としました（前世代では 464 字の旧字体→新字体変換を行っていましたが、本版では廃止）。これにより、原資料に現れる旧字・異体字をそのまま出力でき、語彙もこれらを網羅できるよう拡張しています。なお、漢字の左側にもう一つ仮名が振られる二重ルビ（<rt2>）は、教師データ側でも実出現が非常に稀なうえ、モデルが過剰に発火しがちな構造であったため、現行版では前処理の段階で除外し、語彙にも残しません。

モデル構成

行認識モデルは、画像から構造化文字列を生成するVision-Encoder-Decoder 型のニューラルネットワークで、HuggingFace の VisionEncoderDecoderModel 枠組みを基盤に実装しています。

エンコーダには、画像認識で標準的に用いられる畳み込みネットワークであるConvNeXt V2 の Base 規模（パラメータ約 88M）を採用しました。ConvNeXt V2 は、マスク自己符号化（FCMAE）による自己教師あり事前学習と、チャネル間の応答を正規化する GRN（Global Response Normalization）を取り入れた改良版で、本システムでは ImageNet-22k で 384px に微調整済みの重みを起点としています。縦書きの 1 行画像を 90° 回転して横長化したうえで、高さ 256px・幅最大 2048px（縦横比 8:1 を維持）にアスペクト比保持でリサイズして入力します。内部の累積ストライドが 32 なので、最終段で縦 8×横 64 の特徴マップ（1024 次元のベクトルが計 512 個並んだもの）が得られます。これに対し、「行内のどの位置（縦・横のどのマス）の特徴か」を示す学習可能な 2 次元位置埋め込みを独自に追加しました（行方向 8 マス分、列方向 72 マス分の埋め込みを用意して該当位置を加算し、LayerNorm を適用）。縦書き行を横長化しているため、横軸が読み方向に対応します。位置情報をエンコーダ側で明示的に注入することで、長い行でもデコーダのクロスアテンションが行末まで安定して走査できるようにしています。

デコーダには RoBERTa をベースとした小型 Transformer（6 層、隠れ次元 512、ヘッド数 8）を用いました。これはみんなで翻刻の翻刻文約 5,700 万字を独自コーパスとして整備したうえで、MLM（Masked Language Model）目的で本研究にて事前学習した RoBERTa であり、すなわち本タスクのドメインに最初から最適化された言語モデルを起点としています。これを OCR タスクへ転用し、クロスアテンション部分は新規に学習させました（本版では OCR 本体を継承なしで完全にゼロから学習しています）。文頭トークン <CLS> から文末トークン <SEP> まで、1 トークンずつ次を予測する自己回帰生成です。語彙はコーパス中の頻出文字 7,710 文字を採用した単文字（character-level）トークナイザで、旧字体・異体字を忠実保存する方針に合わせて語彙を拡張し（前世代は 5,000 文字）、本文文字に加えてふりがな・返り点・送り仮名・割書を表す 11 種類の特殊トークンを同居させています。この設計により、文字認識と注釈付与をひとつの生成過程で同時に行えるのが本研究の特徴です。

学習

最適化器は AdamW を用い、新規初期化部（2D 位置埋め込み・エンコーダ／デコーダ間射影層・クロスアテンション）に高学習率 1×10⁻³、エンコーダ本体および事前学習済みデコーダ本体に低学習率 5×10⁻⁵ という 2 群構成としました。混合精度（bfloat16 autocast）で、有効バッチサイズ 64・おおむね 5 エポックを、単機の NVIDIA A100 で学習しています。入力解像度 256×2048 は画素数が多くメモリ消費が大きいため、ミニバッチを縮小して勾配累積を増やすことで実効バッチ数を維持しています。学習率スケジュールは最後の区間で線形に減衰させており、終盤のアニール区間で検証 CER がもう一段下がる挙動が確認できました。

損失関数は、ラベルスムージング 0.1 付きのクロスエントロピーに、構造トークンへの重み 2.0 を加えた重み付き CE 損失を用いています。さらに、行末の <SEP> がときに早期に発火して本文を切り詰める失敗に対し、非 SEP 位置で SEP 確率が高いほどペナルティを与える追加正則化項（フォーカル風、係数 0.5）を導入しました。出現頻度の低い返り点と送り仮名はそのままでは学習信号が弱いため、これらを含む行を学習ストリーム上でそれぞれ 2.0 倍にオーバーサンプリングしています。データ拡張としては albumentations による弾性歪み・形態学的変換（線の太さ揺らぎ）・ガウシアンノイズ・解像度低下・JPEG 圧縮劣化などをランダムに適用し、撮影条件や紙の状態の違いに頑健なモデルを目指しています。

評価

書物単位で分割したテスト集合（59,072 行）に対し、かな表記の揺れを正規化したうえで評価したところ、本文の文字単位正答率はおよそ 92.0%（plain micro CER 0.080）でした。語彙を 7,710 文字へ拡張し旧字・異体字を忠実保存しながらも、本文の認識精度はこの水準を維持しています。注釈構造の付与は、ふりがなの「(漢字, 読み) 完全一致」で F1 0.73、送り仮名で F1 0.71、返り点で F1 0.52（漢文の密な返り点を含む全件）、割書で F1 0.40 の水準で再現できています。

CER の数値について（重要）：ここで報告する CER は、「みんなで翻刻」のボランティア翻刻文を正解（ground truth）とみなして測定したものです。しかしこの翻刻文自体はまだ専門家による校閲を経ておらず、別途実施した 10 万字規模のサンプリング調査では、翻刻文そのものの精度はおよそ 98.5% と見積もられています。すなわち本数値は「誤りを含みうる正解」に対する一致率であり、モデルの真の認識精度とは厳密には一致しません（翻刻側の誤りがモデルの誤りとして計上されたり、その逆も起こり得ます）。今後、専門家による校閲を経た厳密なテストセットを用いて、より正確な CER の測定を実施する予定です。

なお本研究では、構造トークンの評価には、領域内の文字列を連結して編集距離を取る領域 CER ではなく、トークンの多重集合一致による F1 を用いるべきであるという方法論上の知見を得ました。領域 CER は、漢文の返り点が密に並ぶ行で 1 か所のずれが指標全体を大きく歪めるなど、構造評価では誤解を生む振る舞いを示します。F1 は順序非依存で外れ値にも頑健で、より実態に即した値を返します。

注釈構造ごとの Precision / Recall / F1（多重集合一致）は次のとおりです（kana-fold 統一、<rt2> 除去後＝デプロイ相当。v16 の test 59,072 行で実測）。

構造	項目の定義	gold	P	R	F1
ふりがな	(漢字, 読み) ペアの完全一致	29,279	0.764	0.697	0.729
送り仮名	`<OKURI>` 内文字列の完全一致	4,474	0.685	0.726	0.705
返り点	`<KAERI>` 内文字列の完全一致（全件）	11,455	0.535	0.499	0.516
割書	span（右列＋区切り＋左列）の完全一致	2,203	0.453	0.356	0.399

ふりがなは「どの漢字に何の読みを付けるか」という厳しい完全一致でも F1 0.73、送り仮名は F1 0.71 と良好です（送り仮名は前世代から改善）。割書の F1 が低めなのは、1 スパンが長い 2 列テキストで「1 文字でも違えば不正解」という厳しい span 完全一致の定義によるものです。返り点は、みんなで翻刻のテキストに漢文の返り点が密に並ぶ行が多く含まれるため、全件では F1 0.52 にとどまります。これは隣接する仮名や複数の文字を 1 つの返り点として取り込みすぎる過剰取り込みが主因で、引き続きの改善対象です。

所蔵機関ごとの精度を見ると、字体・版面の難易度や撮影条件の違いがそのまま現れます。テスト集合に占める各ホスト（IIIF の画像配信元、便宜上「ホスト」と呼びます）の件数と本文の plain micro CER は次のとおりで、所蔵機関ごとの難易度の幅が現れます（n ≥ 300 のホスト。v16 の test で実測）。

所蔵元（IIIF ホスト）	件数	CER
福井県デジタルアーカイブ`www.digital-archives.pref.fukui.lg.jp`	6,656	0.046
ADEAC`dcfs.trc-adeac.co.jp`	1,809	0.062
東京大学附属図書館`iiif.dl.itc.u-tokyo.ac.jp`	8,376	0.068
京都大学附属図書館`rmda.kulib.kyoto-u.ac.jp`	1,190	0.068
国立国会図書館`dl.ndl.go.jp`	20,367	0.070
琉球大学附属図書館`shimuchi.lib.u-ryukyu.ac.jp`	876	0.074
amane project`ourarchives.amane-project.jp`	5,947	0.080
個人/小規模配信サーバ`os3-373-19774.vs.sakura.ne.jp`	640	0.084
九州大学附属図書館`catalog.lib.kyushu-u.ac.jp`	1,039	0.100
東京学芸大学附属図書館`d-archive.u-gakugei.ac.jp`	1,071	0.102
国立歴史民俗博物館`khirin-a.rekihaku.ac.jp`	3,214	0.103
国文学研究資料館古典籍`kotenseki.nijl.ac.jp`	1,050	0.120
国立国会図書館（旧系統）`www.dl.ndl.go.jp`	3,590	0.125
国文学研究資料館`kokusho.nijl.ac.jp`	1,981	0.129

行数の多いホストでは、最大件数である国立国会図書館で 0.070、東京大学附属図書館で 0.068、福井県デジタルアーカイブでは 0.046 と良好です。一方、国文学研究資料館（古典籍・通常系とも）は 0.12 前後と差があり、字体や版面の難しさの幅がはっきり現れています。全体（59,072 行）の plain micro CER は 0.080 です。

推論時の工夫

ブラウザ上での処理は 2 段階です。まず入力画像から本文の行領域を検出するレイアウト認識を行い、検出した各行について行 OCR（enc-dec 認識）を実行します。レイアウト認識モデルには、NDL古典籍OCR-Lite の RTMDet-s モデルを使用させていただいています（本システム独自の YOLOv8 モデルへ設定で切替も可能）。検出した行 bbox の読み順は、テキストを用いない XY-Cut によりレイアウト認識の段階で確定させます。

学習済みモデルは ONNX 形式に変換し、quantize_dynamic による動的 int8 量子化で軽量化しました。エンコーダは単一ファイル（約 89 MB）として書き出した一方、デコーダは推論時の高速化のため 2 ファイルに分割して書き出しています：初回の <CLS> を流し込み内部状態（Key/Value キャッシュ）を構築するための prefill（約 32 MB）と、以降の 1 トークン生成ごとに既存キャッシュへ次のトークンを追記する step（約 28 MB）です。これによりデコーダ内のセルフアテンション計算が 「これまで生成した全トークンの再計算」から「直前 1 トークンの追加計算」へ短縮され、greedy 推論は理論上 5–10 倍高速化します。これらをブラウザの WebAssembly 実行環境（onnxruntime-web）で動作させています。

推論時には、いくつかの後処理を組み合わせています。第一に、撮影で生じる行の微小な傾きを投影プロファイル法で行ごとに推定し、補正してから認識に渡す処理。第二に、生成中の系列が同一トークンや短周期パターンで反復し始めた場合（行末崩壊）に打ち切る安全機構です。

今後の課題

撮影解像度が低い資料、字形の崩れが極端な資料、また割書のように 1 行に二段組で書かれた要素には、現状でも誤認識が残ります。割書の絶対水準はなお低く、さらなる解像度向上と対象資料に合わせた追加学習を予定しています。返り点については、隣接する仮名や複数文字を 1 つの返り点として取り込みすぎる過剰取り込みの抑制が次の改善対象です。加えて、推論段での文法制約デコード（タグの整合性確保）や言語モデルとの統合（視覚的に紛らわしい文字の補正）も検討しています。なお前述のとおり、現状の CER は校閲前の翻刻文を正解として測定したものであり、専門家による校閲を経た厳密なテストセットでの再評価を進める予定です。

Technical Details

Written June 22, 2026 (v16 release)

Minna de Honkoku OCR is an OCR system that, given a cursive Japanese (kuzushiji) classical book image, generates not only the body text transcription but also the surrounding annotation structure—furigana (ruby readings), okurigana and kaeriten (Japanese-style reading marks for Chinese-text passages), and warigaki (interlinear two-column notes). The output is structured text with tags preserving these annotation relations, making downstream typesetting, search, and reuse straightforward.

Dataset Provenance

Training data is sourced from Minna de Honkoku, a citizen-driven transcription platform launched in 2017 and jointly operated by the National Museum of Japanese History, the Earthquake Research Institute of the University of Tokyo, and the Kyoto University Historical Earthquakes Research Group. The platform hosts classical-book images from the IIIF digital archives of multiple holding institutions, including the National Diet Library, National Institute of Japanese Literature, University of Tokyo Library, Kyoto University Library, National Museum of Japanese History, Fukui Prefectural Digital Archives, and University of the Ryukyus Library. Volunteer transcribers attach transcription and annotations (furigana, kaeriten, etc.) to those images.

From the licensed portion of these transcriptions, we paired per-line position information (coordinates on the IIIF Image API) with transcription text and extracted approximately 1.2 million line-image / transcription pairs, organizing them into our custom webdataset_v3 format. Train/validation/test splits are deterministic by a hash of the title (entryId), so different lines from the same book never cross splits—the system is evaluated for book-level generalization.

This dataset is a scale-up of the Minna de Honkoku Dataset published by NDL Lab, which is also derived from Minna de Honkoku transcriptions. Following the NDL Lab dataset’s design as a foundation, we broaden the source materials and strengthen automatic line-image extraction and annotation-structure preservation (furigana, kaeriten, etc.), expanding the scale to ~1.2 million line pairs.

For automatic line-image extraction we use a page-layout / line-detection model trained on the NDL-DocL Dataset (document-image layout dataset), published by NDL Lab. Line bounding boxes detected on each page are cropped via the IIIF Image API. For aligning each cropped line image with a single line of transcription text (i.e., which bbox corresponds to which line of text), we use the parseq ONNX model from NDL Kotenseki OCR-Lite as a recognition prior: we compute the edit distance (Levenshtein distance) between each tentative parseq output and each transcription line, normalize by length, and accept only pairs whose normalized distance falls below a fixed threshold. This filters out noisy or clearly misaligned correspondences and yields a mechanically validated text–image mapping.

The constructed dataset will be released under an open license in the near future, in a form usable for both academic and commercial purposes. Details will be announced in this repository.

Data Preprocessing

Transcriptions use a notation specific to Minna de Honkoku. For example, “漢字（かんじ）” or “《振り仮名：漢字｜かんじ》” encode furigana; “￣ニ” or “［ニ］” encode okurigana; “＿レ” or “｛レ｝” encode kaeriten; and “《割書：右｜左》” encodes warigaki. As a first step, all of these are normalized into special training tokens: <ruby>, <rt>, <OKURI>, <KAERI>, <WARI>, etc. To absorb kana-orthographic variation across transcribers, we fold isolated single-character katakana into hiragana. In contrast, classical-form (kyujitai) and variant characters are preserved faithfully rather than unified to modern (shinjitai) forms (the previous generation applied a 464-character kyujitai→shinjitai mapping, which is dropped in this version). This lets the model reproduce old and variant glyphs exactly as they appear in the source, and the vocabulary is expanded to cover them. Note that the rare “second-reading” ruby (a furigana on the left side of the kanji, encoded as <rt2>) appears very rarely in the gold and the model used to over-produce it; in the current version it is removed at preprocessing time and excluded from the vocabulary.

Model Architecture

The line recognizer is a Vision-Encoder-Decoder neural network that generates structured text from images, implemented on top of HuggingFace’s VisionEncoderDecoderModel framework.

The encoder uses ConvNeXt V2 at the Base scale (~88M parameters). ConvNeXt V2 adds self-supervised pretraining via a fully convolutional masked autoencoder (FCMAE) and a Global Response Normalization (GRN) layer; we start from weights fine-tuned on ImageNet-22k at 384px. Each vertical line is rotated 90° to a horizontal orientation and resized (aspect-preserving) to height 256px and width up to 2048px (preserving an 8:1 aspect ratio). With a cumulative stride of 32, the encoder produces an 8×64 feature map (512 vectors of 1024 dimensions). On top of this we add a custom learned 2D positional embedding indicating “which row and column of the line is this feature from” (8 row embeddings and 72 column embeddings; we add the appropriate row and column embeddings to each cell and apply LayerNorm).

The decoder is a small RoBERTa-based Transformer (6 layers, hidden size 512, 8 heads). It is initialized from a RoBERTa pretrained on a ~57 million-character corpus assembled from Minna de Honkoku transcriptions, using a masked language modeling (MLM) objective. In other words, the decoder starts from a language model already adapted to the target domain. We then transfer it to the OCR task; the cross-attention is trained from scratch (in this version the OCR model is trained entirely from scratch, without inheriting a prior OCR checkpoint). Generation is autoregressive, one token at a time, from <CLS> until <SEP>. The vocabulary is a character-level tokenizer of the top 7,710 most frequent characters in the corpus—expanded (from 5,000 in the previous generation) to match the faithful-preservation policy for classical and variant glyphs—and it colocates 11 structural special tokens alongside those characters, so that character recognition and annotation tagging happen jointly in a single generation pass.

Figure: Model architecture of Minna de Honkoku OCR

Training

Optimization uses AdamW with two learning-rate groups: the newly initialized modules (2D positional embeddings, the encoder/decoder projection layer, and cross-attention) at a high rate of 1×10⁻³, and the encoder backbone and the pretrained decoder body at a low rate of 5×10⁻⁵. Training runs in bfloat16 autocast with effective batch size 64 for roughly 5 epochs on a single NVIDIA A100. The 256×2048 input is pixel-heavy and memory-intensive, so we shrink the micro-batch and increase gradient accumulation to keep the effective batch size unchanged. The learning rate is annealed linearly in the final segment, and the validation CER drops a further notch during that anneal.

The loss is cross-entropy with label smoothing 0.1, augmented with weighted CE (weight 2.0 on structural tokens). We additionally add a focal-style regularizer that penalizes high <SEP> probability at non-terminal positions (coefficient 0.5) to discourage the model from cutting lines short. To compensate for the low frequency of kaeriten and okurigana, we oversample the corresponding training lines by 2.0× each. Augmentation via albumentations randomly applies elastic distortion, morphological operations, Gaussian noise, resolution reduction, and JPEG compression artifacts.

Evaluation

On a book-level test split of 59,072 lines, with kana variation normalized, plain text CER is approximately 0.080 (~92.0% character-level accuracy). The body-text accuracy holds at this level even though the vocabulary was expanded to 7,710 characters to faithfully preserve classical and variant glyphs. For annotation tagging, F1 reaches 0.73 for the strict “exact (kanji, reading) pair” furigana match, 0.71 for okurigana, 0.52 for kaeriten (all spans, including densely-marked kanbun), and 0.40 for warigaki.

About the CER figures (important). The CER reported here is measured against the volunteer transcriptions of Minna de Honkoku as ground truth. However, those transcriptions have not yet undergone expert review: a separate sampling survey over ~100,000 characters estimated the transcriptions’ own accuracy at about 98.5%. The numbers above are therefore agreement rates against an imperfect reference—errors in the transcription can be counted as model errors and vice versa—and do not strictly equal the model’s true recognition accuracy. We plan a more rigorous CER measurement using an expert-reviewed test set in the future.

Methodologically, structural tokens are best evaluated with multiset-token F1 rather than region CER. F1 is order-invariant and outlier-robust, and gives a more realistic picture.

Per-structure Precision / Recall / F1 (multiset match; kana-fold normalized, spurious <rt2> stripped = deployed configuration; measured on the v16 test split of 59,072 lines):

Structure	Unit definition	gold	P	R	F1
Furigana	exact (kanji, reading) pair match	29,279	0.764	0.697	0.729
Okurigana	exact `<OKURI>` span string match	4,474	0.685	0.726	0.705
Kaeriten	exact `<KAERI>` span string match (all spans)	11,455	0.535	0.499	0.516
Warigaki	exact span match (right + separator + left)	2,203	0.453	0.356	0.399

Furigana keeps F1 0.73 even under the strict “which kanji gets which reading” exact match, and okurigana reaches F1 0.71 (improved over the previous generation). Warigaki F1 is relatively low because its unit is a long two-column string and any single-character mismatch fails the exact span match. Kaeriten sits at F1 0.52 over all spans, because Minna de Honkoku contains many lines of densely-marked Chinese (kanbun) text; the model tends to over-capture adjacent kana or multiple characters into a single return mark—a continuing improvement target.

Per-host plain micro CER (hosts with n ≥ 300; measured on the v16 test split), showing the range of per-institution difficulty:

Holding institution (IIIF host)	n	CER
Fukui Prefectural Digital Archives`www.digital-archives.pref.fukui.lg.jp`	6,656	0.046
ADEAC`dcfs.trc-adeac.co.jp`	1,809	0.062
University of Tokyo Library`iiif.dl.itc.u-tokyo.ac.jp`	8,376	0.068
Kyoto University Library`rmda.kulib.kyoto-u.ac.jp`	1,190	0.068
National Diet Library`dl.ndl.go.jp`	20,367	0.070
University of the Ryukyus Library`shimuchi.lib.u-ryukyu.ac.jp`	876	0.074
amane project`ourarchives.amane-project.jp`	5,947	0.080
Individual / small-scale server`os3-373-19774.vs.sakura.ne.jp`	640	0.084
Kyushu University Library`catalog.lib.kyushu-u.ac.jp`	1,039	0.100
Tokyo Gakugei University Library`d-archive.u-gakugei.ac.jp`	1,071	0.102
National Museum of Japanese History`khirin-a.rekihaku.ac.jp`	3,214	0.103
NIJL Classical Books`kotenseki.nijl.ac.jp`	1,050	0.120
National Diet Library (legacy)`www.dl.ndl.go.jp`	3,590	0.125
National Institute of Japanese Literature`kokusho.nijl.ac.jp`	1,981	0.129

Among higher-volume hosts, the largest sample—the National Diet Library—reaches 0.070, the University of Tokyo Library 0.068, and the Fukui Prefectural Digital Archives 0.046, whereas the National Institute of Japanese Literature (both classical and regular collections) sits around 0.12, clearly reflecting the range of script and page-layout difficulty. The overall plain micro CER across all 59,072 lines is 0.080.

Inference

Processing in the browser is a two-stage pipeline: first layout recognition detects the body-text line regions in the input image, then line OCR (enc-dec recognition) runs on each detected line. For layout recognition we use the RTMDet-s model from NDL Kotenseki OCR-Lite (switchable in settings to this system’s own YOLOv8 model). The reading order of the detected line boxes is fixed during the layout stage by a text-free XY-Cut.

The trained model is exported to ONNX and compressed via quantize_dynamic (dynamic int8 quantization). The encoder is shipped as a single file (~89 MB), while for inference speed the decoder is split into two files: a prefill graph (~32 MB) that runs once on <CLS> and constructs the internal Key/Value cache, and a step graph (~28 MB) that consumes the cache and a single new token at each iteration. This means decoder self-attention shifts from “recompute over all tokens generated so far” to “extend by one new token,” giving a theoretical 5–10× speedup on greedy decoding. The graphs are executed in the browser’s WebAssembly environment using onnxruntime-web.

Several post-processing steps run at inference time. First, small per-line tilts are estimated with a projection-profile method and corrected before recognition. Second, a safety guard terminates generation if the output begins repeating the same token or a short cyclic pattern (end-of-line collapse).

Future Work

Low-resolution images, severely deformed scripts, and elements written as two columns within a single line (warigaki) remain failure modes. The absolute level for warigaki is still low; we plan further resolution increases and targeted additional training on specific source materials. For kaeriten, the next target is suppressing over-capture—the tendency to absorb adjacent kana or multiple characters into a single return mark. We are also exploring grammar-constrained decoding at inference time (to enforce tag consistency) and integration with a language model (to correct visually confusable characters). As noted above, the current CER is measured against unreviewed transcriptions as ground truth; we plan to re-evaluate on a rigorous test set that has undergone expert review.