【NLP】Perplexity・BLEU・ROUGEの理論と実装

言語モデルやテキスト生成システムの性能を評価するには、適切な評価指標が必要です。Perplexity、BLEU、ROUGEは最も広く使われる指標であり、それぞれ異なる側面を測定します。

本記事では、これらの評価指標の理論と実装を解説します。

本記事の内容

Perplexityの理論と計算
BLEUスコアの仕組み
ROUGEの種類と使い分け
その他の評価指標
Pythonでの実装

Perplexity（困惑度）

理論

Perplexityは言語モデルの性能を測る最も基本的な指標です。モデルがテストデータをどれだけ「予測しやすい」と感じているかを表します。

定義

テスト系列 $W = (w_1, w_2, \ldots, w_N)$ に対するPerplexityは：

$$ \text{PPL}(W) = P(w_1, w_2, \ldots, w_N)^{-1/N} $$

対数を使って計算すると：

$$ \text{PPL}(W) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i \mid w_1, \ldots, w_{i-1})\right) $$

直感的な解釈

Perplexityは「次の単語を予測する際の平均的な選択肢の数」と解釈できます。

Perplexity	解釈
1	完璧な予測（確実に正解）
10	平均10択相当の難しさ
100	平均100択相当の難しさ
語彙サイズ	ランダム予測と同等

情報理論との関係

Perplexityはクロスエントロピー $H$ と以下の関係があります：

$$ \text{PPL} = 2^{H(P, Q)} = \exp(H(P, Q)) $$

ここで：

$$ H(P, Q) = -\frac{1}{N} \sum_{i=1}^{N} \log_2 P(w_i \mid w_{

Pythonでの実装

import torch
import torch.nn.functional as F
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer

def calculate_perplexity(model, tokenizer, text, device='cpu'):
    """
    Perplexityを計算

    Parameters:
    -----------
    model : transformers model
        言語モデル
    tokenizer : transformers tokenizer
        トークナイザー
    text : str
        評価するテキスト
    device : str
        デバイス

    Returns:
    --------
    perplexity : float
        Perplexity値
    """
    model.eval()
    model.to(device)

    encodings = tokenizer(text, return_tensors='pt')
    input_ids = encodings.input_ids.to(device)

    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss

    perplexity = torch.exp(loss).item()
    return perplexity


def calculate_perplexity_batch(model, tokenizer, texts, device='cpu', batch_size=8):
    """バッチ処理でPerplexityを計算"""
    model.eval()
    model.to(device)

    total_loss = 0
    total_tokens = 0

    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        encodings = tokenizer(
            batch_texts,
            return_tensors='pt',
            padding=True,
            truncation=True,
            max_length=512
        )

        input_ids = encodings.input_ids.to(device)
        attention_mask = encodings.attention_mask.to(device)

        with torch.no_grad():
            outputs = model(
                input_ids,
                attention_mask=attention_mask,
                labels=input_ids
            )

        # マスクされていないトークンのみカウント
        n_tokens = attention_mask.sum().item()
        total_loss += outputs.loss.item() * n_tokens
        total_tokens += n_tokens

    avg_loss = total_loss / total_tokens
    perplexity = np.exp(avg_loss)

    return perplexity


# 使用例
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = "The quick brown fox jumps over the lazy dog."
ppl = calculate_perplexity(model, tokenizer, text)
print(f"Perplexity: {ppl:.2f}")

BLEU（Bilingual Evaluation Understudy）

理論

BLEUは機械翻訳の評価指標として開発されましたが、テキスト生成全般に使われます。生成テキストと参照テキストのn-gramの一致度を測定します。

Modified n-gram Precision

n-gramの精度を計算しますが、同じn-gramの過剰カウントを防ぐために修正されています：

$$ p_n = \frac{\sum_{C \in \{\text{Candidates}\}} \sum_{\text{n-gram} \in C} \text{Count}_{\text{clip}}(\text{n-gram})}{\sum_{C \in \{\text{Candidates}\}} \sum_{\text{n-gram} \in C} \text{Count}(\text{n-gram})} $$

ここで、$\text{Count}_{\text{clip}}$ は参照文中の出現回数でクリップされたカウントです。

Brevity Penalty

短い生成文へのペナルティ：

$$ \text{BP} = \begin{cases} 1 & \text{if } c > r \\ e^{1 – r/c} & \text{if } c \leq r \end{cases} $$

ここで、$c$ は生成文の長さ、$r$ は参照文の長さです。

BLEU Score

$$ \text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) $$

通常、$N=4$、$w_n = 1/N$ が使われます。

BLEUの問題点

問題	説明
意味の無視	同義語は評価されない
文の順序	n-gramベースなので文構造を考慮しない
参照依存	参照文の質に大きく依存
短い文	短い文では不安定

Pythonでの実装

from collections import Counter
import numpy as np

def get_ngrams(tokens, n):
    """n-gramを抽出"""
    return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]


def count_ngrams(tokens, n):
    """n-gramの出現回数をカウント"""
    ngrams = get_ngrams(tokens, n)
    return Counter(ngrams)


def modified_precision(candidate, references, n):
    """Modified n-gram Precisionを計算"""
    candidate_ngrams = count_ngrams(candidate, n)

    # 各参照文のn-gramカウントの最大値を取る
    max_ref_counts = Counter()
    for ref in references:
        ref_ngrams = count_ngrams(ref, n)
        for ngram, count in ref_ngrams.items():
            max_ref_counts[ngram] = max(max_ref_counts[ngram], count)

    # クリップされたカウント
    clipped_counts = {
        ngram: min(count, max_ref_counts[ngram])
        for ngram, count in candidate_ngrams.items()
    }

    numerator = sum(clipped_counts.values())
    denominator = sum(candidate_ngrams.values())

    if denominator == 0:
        return 0

    return numerator / denominator


def brevity_penalty(candidate, references):
    """Brevity Penaltyを計算"""
    c = len(candidate)

    # 最も近い参照文の長さを選択
    ref_lengths = [len(ref) for ref in references]
    r = min(ref_lengths, key=lambda x: (abs(x - c), x))

    if c > r:
        return 1
    elif c == 0:
        return 0
    else:
        return np.exp(1 - r / c)


def calculate_bleu(candidate, references, max_n=4, weights=None):
    """
    BLEUスコアを計算

    Parameters:
    -----------
    candidate : list
        生成されたトークンのリスト
    references : list of list
        参照トークンのリストのリスト
    max_n : int
        最大n-gram
    weights : list
        各n-gramの重み

    Returns:
    --------
    bleu : float
        BLEUスコア
    """
    if weights is None:
        weights = [1/max_n] * max_n

    # Modified Precision for each n
    precisions = []
    for n in range(1, max_n + 1):
        p = modified_precision(candidate, references, n)
        precisions.append(p)

    # ゼロ精度のチェック
    if any(p == 0 for p in precisions):
        return 0

    # 幾何平均
    log_precision = sum(w * np.log(p) for w, p in zip(weights, precisions))

    # Brevity Penalty
    bp = brevity_penalty(candidate, references)

    bleu = bp * np.exp(log_precision)
    return bleu


# 使用例
candidate = "the cat sat on the mat".split()
references = [
    "the cat is on the mat".split(),
    "a cat sat on the mat".split()
]

bleu = calculate_bleu(candidate, references, max_n=4)
print(f"BLEU Score: {bleu:.4f}")

# sacrebleuライブラリを使用（推奨）
from sacrebleu.metrics import BLEU

bleu_scorer = BLEU()
candidate_text = "the cat sat on the mat"
reference_texts = ["the cat is on the mat", "a cat sat on the mat"]

result = bleu_scorer.sentence_score(candidate_text, [reference_texts])
print(f"sacrebleu BLEU: {result.score:.2f}")

ROUGE（Recall-Oriented Understudy for Gisting Evaluation）

理論

ROUGEは要約評価のために開発された指標群です。BLEUと異なり、再現率（Recall）に重点を置いています。

ROUGE-N

n-gramベースの再現率：

$$ \text{ROUGE-N} = \frac{\sum_{S \in \{\text{Ref}\}} \sum_{\text{n-gram} \in S} \text{Count}_{\text{match}}(\text{n-gram})}{\sum_{S \in \{\text{Ref}\}} \sum_{\text{n-gram} \in S} \text{Count}(\text{n-gram})} $$

ROUGE-L

最長共通部分列（LCS）ベースの指標：

$$ R_{\text{LCS}} = \frac{\text{LCS}(X, Y)}{m} $$ $$ P_{\text{LCS}} = \frac{\text{LCS}(X, Y)}{n} $$ $$ F_{\text{LCS}} = \frac{(1 + \beta^2) R_{\text{LCS}} P_{\text{LCS}}}{R_{\text{LCS}} + \beta^2 P_{\text{LCS}}} $$

ここで、$m$ は参照文の長さ、$n$ は生成文の長さ、$\beta$ は通常1.2です。

ROUGE-W

重み付きLCS。連続したマッチに高い重みを付けます。

ROUGE-S

Skip-bigramベースの指標。文中の任意の2単語ペアを考慮します。

Pythonでの実装

from collections import Counter
import numpy as np

def lcs_length(x, y):
    """最長共通部分列の長さを計算"""
    m, n = len(x), len(y)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if x[i-1] == y[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])

    return dp[m][n]


def rouge_n(candidate, reference, n=1):
    """
    ROUGE-N（Precision, Recall, F1）を計算

    Parameters:
    -----------
    candidate : list
        生成されたトークンのリスト
    reference : list
        参照トークンのリスト
    n : int
        n-gramのn

    Returns:
    --------
    dict : precision, recall, f1
    """
    candidate_ngrams = count_ngrams(candidate, n)
    reference_ngrams = count_ngrams(reference, n)

    # マッチしたn-gramのカウント
    matches = sum(min(candidate_ngrams[ng], reference_ngrams[ng])
                  for ng in candidate_ngrams if ng in reference_ngrams)

    candidate_count = sum(candidate_ngrams.values())
    reference_count = sum(reference_ngrams.values())

    precision = matches / candidate_count if candidate_count > 0 else 0
    recall = matches / reference_count if reference_count > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

    return {
        'precision': precision,
        'recall': recall,
        'f1': f1
    }


def rouge_l(candidate, reference, beta=1.2):
    """
    ROUGE-L（LCSベース）を計算

    Parameters:
    -----------
    candidate : list
        生成されたトークンのリスト
    reference : list
        参照トークンのリスト
    beta : float
        F値計算のパラメータ

    Returns:
    --------
    dict : precision, recall, f1
    """
    lcs = lcs_length(candidate, reference)
    m = len(reference)
    n = len(candidate)

    recall = lcs / m if m > 0 else 0
    precision = lcs / n if n > 0 else 0

    if precision + recall > 0:
        f1 = ((1 + beta**2) * precision * recall) / (recall + beta**2 * precision)
    else:
        f1 = 0

    return {
        'precision': precision,
        'recall': recall,
        'f1': f1
    }


# 使用例
candidate = "the cat sat on the mat".split()
reference = "the cat is sitting on the mat right now".split()

print("ROUGE-1:", rouge_n(candidate, reference, n=1))
print("ROUGE-2:", rouge_n(candidate, reference, n=2))
print("ROUGE-L:", rouge_l(candidate, reference))

# rouge-scoreライブラリを使用（推奨）
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
candidate_text = "the cat sat on the mat"
reference_text = "the cat is sitting on the mat right now"

scores = scorer.score(reference_text, candidate_text)
for key, value in scores.items():
    print(f"{key}: P={value.precision:.4f}, R={value.recall:.4f}, F1={value.fmeasure:.4f}")

その他の評価指標

METEOR

同義語や活用形を考慮した指標：

from nltk.translate.meteor_score import meteor_score
import nltk
# nltk.download('wordnet')

candidate = "the cat sat on the mat"
reference = "the cat is sitting on the mat"

score = meteor_score([reference.split()], candidate.split())
print(f"METEOR: {score:.4f}")

BERTScore

BERTの埋め込みを使った意味的類似度：

from bert_score import score as bert_score

candidates = ["the cat sat on the mat"]
references = ["the cat is sitting on the mat"]

P, R, F1 = bert_score(candidates, references, lang="en")
print(f"BERTScore: P={P.mean():.4f}, R={R.mean():.4f}, F1={F1.mean():.4f}")

各指標の使い分け

指標	用途	特徴
Perplexity	言語モデル評価	内在評価、参照不要
BLEU	翻訳、生成	精度重視
ROUGE	要約	再現率重視
METEOR	翻訳	同義語考慮
BERTScore	任意のテキスト	意味的類似度

実験：複数指標の比較

import numpy as np
import matplotlib.pyplot as plt
from rouge_score import rouge_scorer
from sacrebleu.metrics import BLEU

def evaluate_all_metrics(candidate, references):
    """複数の指標で評価"""
    # BLEU
    bleu_scorer = BLEU()
    bleu_result = bleu_scorer.sentence_score(candidate, [references])

    # ROUGE
    rouge = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    rouge_scores = rouge.score(references[0], candidate)

    results = {
        'BLEU': bleu_result.score,
        'ROUGE-1 F1': rouge_scores['rouge1'].fmeasure * 100,
        'ROUGE-2 F1': rouge_scores['rouge2'].fmeasure * 100,
        'ROUGE-L F1': rouge_scores['rougeL'].fmeasure * 100,
    }

    return results


# テストケース
test_cases = [
    {
        'name': 'Perfect Match',
        'candidate': 'the cat sat on the mat',
        'references': ['the cat sat on the mat']
    },
    {
        'name': 'Synonym',
        'candidate': 'the feline rested on the rug',
        'references': ['the cat sat on the mat']
    },
    {
        'name': 'Word Order',
        'candidate': 'on the mat sat the cat',
        'references': ['the cat sat on the mat']
    },
    {
        'name': 'Extra Words',
        'candidate': 'the big fluffy cat sat quietly on the soft mat',
        'references': ['the cat sat on the mat']
    },
    {
        'name': 'Missing Words',
        'candidate': 'cat sat mat',
        'references': ['the cat sat on the mat']
    },
]

# 評価
results_df = []
for case in test_cases:
    scores = evaluate_all_metrics(case['candidate'], case['references'])
    scores['Case'] = case['name']
    results_df.append(scores)

# 可視化
import pandas as pd
df = pd.DataFrame(results_df)
df = df.set_index('Case')

fig, ax = plt.subplots(figsize=(12, 6))
df.plot(kind='bar', ax=ax)
ax.set_ylabel('Score')
ax.set_title('Comparison of Evaluation Metrics')
ax.legend(loc='upper right')
ax.set_ylim(0, 100)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig('metrics_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print(df.to_string())

まとめ

本記事では、言語モデルの評価指標について解説しました。

Perplexity: 言語モデルの予測能力を測定、低いほど良い
BLEU: n-gram精度ベース、翻訳・生成の評価に使用
ROUGE: n-gram再現率ベース、要約の評価に使用
BERTScore: 意味的類似度を測定

単一の指標に頼らず、複数の指標を組み合わせて評価することが重要です。また、自動評価指標は人間の判断と完全には一致しないため、重要な評価では人間評価も併用しましょう。

次のステップとして、以下の記事も参考にしてください。

機械学習と情報技術

【NLP】Perplexity・BLEU・ROUGEの理論と実装

Perplexity（困惑度）

理論

直感的な解釈

情報理論との関係

Pythonでの実装

BLEU（Bilingual Evaluation Understudy）

理論

BLEUの問題点

Pythonでの実装

ROUGE（Recall-Oriented Understudy for Gisting Evaluation）

理論

ROUGE-N

ROUGE-L

ROUGE-W

ROUGE-S

Pythonでの実装

その他の評価指標

METEOR

BERTScore

各指標の使い分け

実験：複数指標の比較

まとめ

AIセーフティとアライメント — RLHF/DPO/CAIの理論

Hugging Face Transformersの使い方と実践ガイド

【NLP】Perplexity・BLEU・ROUGEの理論と実装

Perplexity（困惑度）

理論

直感的な解釈

情報理論との関係

Pythonでの実装

BLEU（Bilingual Evaluation Understudy）

理論

BLEUの問題点

Pythonでの実装

ROUGE（Recall-Oriented Understudy for Gisting Evaluation）

理論

ROUGE-N

ROUGE-L

ROUGE-W

ROUGE-S

Pythonでの実装

その他の評価指標

METEOR

BERTScore

各指標の使い分け

実験：複数指標の比較

まとめ

関連記事

BERTのアーキテクチャと事前学習を解説

Constitutional AIとは？原則に基づくアライメント手法を解説

Chain-of-Thought推論を理解してLLMの性能を引き出す

AIセーフティとアライメント — RLHF/DPO/CAIの理論

Hugging Face Transformersの使い方と実践ガイド