Dropoutの理論とMC Dropoutによる不確実性推定

Dropout は、ニューラルネットワークの過学習を防ぐ正則化手法として広く使われています。学習時にランダムにユニットを無効化するというシンプルなアイデアながら、強力な汎化性能向上効果を持ちます。

本記事では、Dropoutの理論的背景から実装まで詳しく解説します。

本記事の内容

Dropoutの動機とアルゴリズム
アンサンブル学習としての解釈
L1/L2正則化との比較
PyTorchでの実装と実験

前提知識

この記事を読む前に、以下の記事を読んでおくと理解が深まります。

Batch Normalizationの理論と実装

過学習とは

過学習の問題

深層ニューラルネットワークは膨大なパラメータを持つため、訓練データに過度に適合（過学習）しやすくなります。

過学習の兆候： – 訓練損失は下がり続けるが、検証損失は上がる – 訓練精度は高いが、テスト精度が低い – 訓練データのノイズまで学習してしまう

正則化の目的

正則化は、モデルの複雑さを制限することで過学習を防ぎます：

パラメータの大きさを制限（L1/L2正則化）
モデルの有効な自由度を減らす（Dropout）
データを増やす効果（Data Augmentation）

Dropoutのアルゴリズム

基本的なアイデア

学習時に、各ユニットを確率 $p$ でランダムに「ドロップアウト」（無効化）します。

数学的定義

入力 $\bm{x} = (x_1, x_2, \ldots, x_d)$ に対して：

学習時:

$$ \tilde{x}_i = \begin{cases} 0 & \text{with probability } p \\ \frac{x_i}{1-p} & \text{with probability } 1-p \end{cases} $$

または、マスク $\bm{m} \sim \text{Bernoulli}(1-p)$ を使って：

$$ \tilde{\bm{x}} = \frac{\bm{m} \odot \bm{x}}{1-p} $$

$\odot$ は要素ごとの積です。

推論時:

$$ \tilde{\bm{x}} = \bm{x} $$

（Dropoutなし、スケーリングも不要）

なぜ $1/(1-p)$ でスケーリングするのか

学習時に一部のユニットを無効化すると、出力の期待値が変わります。

スケーリングなしの場合： $$ \mathbb{E}[\tilde{x}_i] = (1-p) \cdot x_i $$

スケーリングありの場合： $$ \mathbb{E}[\tilde{x}_i] = (1-p) \cdot \frac{x_i}{1-p} = x_i $$

これにより、学習時と推論時で期待値が一致し、推論時に特別な処理が不要になります。

Inverted Dropout

上記の方法を「Inverted Dropout」と呼びます。もう一つの方法として：

学習時: スケーリングなしでDropout 推論時: 出力を $(1-p)$ 倍

どちらも数学的には等価ですが、Inverted Dropoutは推論時の処理が不要で効率的です。

Dropoutの理論的解釈

アンサンブル学習としての解釈

Dropoutは、指数的に多くのサブネットワークのアンサンブルとみなせます。

$n$ 個のユニットがあるとき、$2^n$ 通りのサブネットワークが存在します。Dropoutは、各ミニバッチで異なるサブネットワークを学習し、推論時にはその「平均」を取ることに相当します。

重み共有アンサンブル

通常のアンサンブルとの違いは、サブネットワーク間で重みを共有していることです。これにより：

メモリ効率が良い
各サブネットワークが互いを補完するように学習
単一モデルの計算コストで多数のモデルの効果

ベイズ推論としての解釈

Dropoutは、重みに対するベイズ推論の近似とも解釈できます。

通常の学習では点推定（1つの重みの値）を求めますが、Dropoutは重みの分布を暗黙的に学習し、その分布からサンプリングして予測を行うことに相当します。

共適応の防止

ユニット間の「共適応」（co-adaptation）を防ぐ効果もあります。

共適応とは、特定のユニットの組み合わせでのみ機能する学習のことです。Dropoutによりランダムにユニットが欠けるため、各ユニットは他のユニットに依存せず、独立して有用な特徴を学習するようになります。

L1/L2正則化

L2正則化（Weight Decay）

損失関数にパラメータのL2ノルムを追加：

$$ \mathcal{L}_{\text{reg}} = \mathcal{L} + \frac{\lambda}{2} \|\bm{w}\|_2^2 = \mathcal{L} + \frac{\lambda}{2} \sum_i w_i^2 $$

勾配： $$ \frac{\partial \mathcal{L}_{\text{reg}}}{\partial w_i} = \frac{\partial \mathcal{L}}{\partial w_i} + \lambda w_i $$

効果： – 重みを0に近づける（縮小） – 滑らかな関数を学習 – 特徴間の相関に対処

L1正則化（Lasso）

損失関数にパラメータのL1ノルムを追加：

$$ \mathcal{L}_{\text{reg}} = \mathcal{L} + \lambda \|\bm{w}\|_1 = \mathcal{L} + \lambda \sum_i |w_i| $$

効果： – 重みをスパース（多くが0）にする – 特徴選択の効果 – 原点で微分不可能（サブ勾配を使用）

L1 vs L2 vs Dropout

手法	効果	計算コスト
L1	スパース化	低
L2	重み縮小	低
Dropout	アンサンブル効果	学習時のみ増加

Pythonでの実装

Dropoutのスクラッチ実装

import numpy as np
import matplotlib.pyplot as plt

class Dropout:
    """Dropoutのスクラッチ実装"""

    def __init__(self, p=0.5):
        """
        Args:
            p: ドロップアウト確率（無効化する確率）
        """
        self.p = p
        self.mask = None
        self.training = True

    def forward(self, x):
        if self.training and self.p > 0:
            # ベルヌーイマスクを生成
            self.mask = (np.random.rand(*x.shape) > self.p).astype(np.float32)
            # Inverted Dropout: 学習時にスケーリング
            return x * self.mask / (1 - self.p)
        else:
            return x

    def backward(self, dout):
        if self.training and self.p > 0:
            return dout * self.mask / (1 - self.p)
        else:
            return dout

    def train(self):
        self.training = True

    def eval(self):
        self.training = False

# 動作確認
np.random.seed(42)
dropout = Dropout(p=0.5)

x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])

print("Input:", x)

dropout.train()
for i in range(3):
    out = dropout.forward(x)
    print(f"Training output {i+1}:", out.round(2))

dropout.eval()
out = dropout.forward(x)
print("Inference output:", out.round(2))

# 期待値の確認
n_samples = 10000
outputs = np.array([dropout.forward(x) for _ in range(n_samples)])
dropout.train()
outputs_train = np.array([dropout.forward(x) for _ in range(n_samples)])

print(f"\nExpected value (training): {outputs_train.mean(axis=0).round(4)}")
print(f"Original input: {x}")

PyTorchでの実装と比較

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
import matplotlib.pyplot as plt

class MLPWithDropout(nn.Module):
    """Dropout付きMLP"""

    def __init__(self, input_dim=20, hidden_dim=256, num_classes=10, dropout_rate=0.5):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(hidden_dim, num_classes)
        )

    def forward(self, x):
        return self.net(x)

class MLPWithL2(nn.Module):
    """L2正則化用MLP（Weight Decayで実現）"""

    def __init__(self, input_dim=20, hidden_dim=256, num_classes=10):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_classes)
        )

    def forward(self, x):
        return self.net(x)

class MLPNoRegularization(nn.Module):
    """正則化なしMLP"""

    def __init__(self, input_dim=20, hidden_dim=256, num_classes=10):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_classes)
        )

    def forward(self, x):
        return self.net(x)

def create_dataset(n_samples=1000, input_dim=20, num_classes=5, noise=0.3):
    """ノイズ付きデータセット（過学習しやすい）"""
    np.random.seed(42)

    # 少なめのサンプル数で過学習を起こしやすくする
    X = np.random.randn(n_samples, input_dim).astype(np.float32)
    W_true = np.random.randn(input_dim, num_classes).astype(np.float32)
    logits = X @ W_true + np.random.randn(n_samples, num_classes).astype(np.float32) * noise
    y = np.argmax(logits, axis=1)

    # Train/Test split
    split = int(0.7 * n_samples)
    return (X[:split], y[:split]), (X[split:], y[split:])

def train_and_evaluate(model, train_loader, test_loader, n_epochs=100,
                       lr=0.01, weight_decay=0.0):
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)

    train_losses = []
    train_accuracies = []
    test_accuracies = []

    for epoch in range(n_epochs):
        # Training
        model.train()
        epoch_loss = 0
        correct_train = 0
        total_train = 0

        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()
            _, predicted = torch.max(outputs, 1)
            total_train += y_batch.size(0)
            correct_train += (predicted == y_batch).sum().item()

        train_losses.append(epoch_loss / len(train_loader))
        train_accuracies.append(correct_train / total_train)

        # Evaluation
        model.eval()
        correct_test = 0
        total_test = 0
        with torch.no_grad():
            for X_batch, y_batch in test_loader:
                outputs = model(X_batch)
                _, predicted = torch.max(outputs, 1)
                total_test += y_batch.size(0)
                correct_test += (predicted == y_batch).sum().item()
        test_accuracies.append(correct_test / total_test)

    return train_losses, train_accuracies, test_accuracies

# データ準備
(X_train, y_train), (X_test, y_test) = create_dataset(n_samples=800)

train_dataset = TensorDataset(torch.tensor(X_train), torch.tensor(y_train))
test_dataset = TensorDataset(torch.tensor(X_test), torch.tensor(y_test))

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# 各モデルの学習
results = {}

# 正則化なし
torch.manual_seed(42)
model_no_reg = MLPNoRegularization()
results['No Regularization'] = train_and_evaluate(model_no_reg, train_loader, test_loader)

# L2正則化（Weight Decay）
torch.manual_seed(42)
model_l2 = MLPWithL2()
results['L2 (Weight Decay)'] = train_and_evaluate(model_l2, train_loader, test_loader, weight_decay=0.01)

# Dropout
torch.manual_seed(42)
model_dropout = MLPWithDropout(dropout_rate=0.5)
results['Dropout (p=0.5)'] = train_and_evaluate(model_dropout, train_loader, test_loader)

# Dropout + L2
torch.manual_seed(42)
model_both = MLPWithDropout(dropout_rate=0.3)
results['Dropout + L2'] = train_and_evaluate(model_both, train_loader, test_loader, weight_decay=0.001)

# 可視化
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

colors = {
    'No Regularization': 'red',
    'L2 (Weight Decay)': 'blue',
    'Dropout (p=0.5)': 'green',
    'Dropout + L2': 'purple'
}

# 訓練損失
ax1 = axes[0]
for name, (train_loss, train_acc, test_acc) in results.items():
    ax1.plot(train_loss, label=name, color=colors[name], linewidth=1.5)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Training Loss')
ax1.set_title('Training Loss')
ax1.legend()
ax1.grid(True, alpha=0.3)

# 訓練精度
ax2 = axes[1]
for name, (train_loss, train_acc, test_acc) in results.items():
    ax2.plot(train_acc, label=name, color=colors[name], linewidth=1.5)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Training Accuracy')
ax2.set_title('Training Accuracy')
ax2.legend()
ax2.grid(True, alpha=0.3)

# テスト精度
ax3 = axes[2]
for name, (train_loss, train_acc, test_acc) in results.items():
    ax3.plot(test_acc, label=name, color=colors[name], linewidth=1.5)
ax3.set_xlabel('Epoch')
ax3.set_ylabel('Test Accuracy')
ax3.set_title('Test Accuracy')
ax3.legend()
ax3.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# 最終結果
print("\nFinal Results:")
print(f"{'Method':<25} {'Train Acc':>12} {'Test Acc':>12} {'Gap':>10}")
print("-" * 60)
for name, (train_loss, train_acc, test_acc) in results.items():
    gap = train_acc[-1] - test_acc[-1]
    print(f"{name:<25} {train_acc[-1]:>12.4f} {test_acc[-1]:>12.4f} {gap:>10.4f}")

MC Dropout（不確実性推定）

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

class MCDropoutModel(nn.Module):
    """MC Dropout用モデル"""

    def __init__(self, input_dim=1, hidden_dim=100, dropout_rate=0.1):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, 1)
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.fc3(x)
        return x

def mc_dropout_predict(model, x, n_samples=100):
    """MC Dropoutによる予測と不確実性推定"""
    model.train()  # Dropoutを有効にするためtrainモードに

    predictions = []
    for _ in range(n_samples):
        with torch.no_grad():
            pred = model(x)
            predictions.append(pred.numpy())

    predictions = np.array(predictions)
    mean = predictions.mean(axis=0)
    std = predictions.std(axis=0)

    return mean, std

# データ生成
np.random.seed(42)
torch.manual_seed(42)

# 訓練データ
X_train = np.linspace(-3, 3, 50).reshape(-1, 1).astype(np.float32)
y_train = np.sin(X_train) + np.random.randn(*X_train.shape).astype(np.float32) * 0.1

# テストデータ（外挿領域含む）
X_test = np.linspace(-5, 5, 200).reshape(-1, 1).astype(np.float32)
y_true = np.sin(X_test)

# モデル学習
model = MCDropoutModel(dropout_rate=0.1)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = nn.MSELoss()

for epoch in range(1000):
    model.train()
    optimizer.zero_grad()
    outputs = model(torch.tensor(X_train))
    loss = criterion(outputs, torch.tensor(y_train))
    loss.backward()
    optimizer.step()

# MC Dropout予測
mean, std = mc_dropout_predict(model, torch.tensor(X_test), n_samples=100)

# 可視化
plt.figure(figsize=(10, 6))

# 予測の平均と不確実性
plt.fill_between(X_test.flatten(),
                 (mean - 2*std).flatten(),
                 (mean + 2*std).flatten(),
                 alpha=0.3, label='95% Confidence Interval')
plt.plot(X_test, mean, 'b-', linewidth=2, label='MC Dropout Mean')
plt.plot(X_test, y_true, 'g--', linewidth=2, label='True Function')
plt.scatter(X_train, y_train, c='red', s=20, label='Training Data', zorder=5)

# 訓練データの範囲を示す
plt.axvline(x=-3, color='gray', linestyle=':', alpha=0.5)
plt.axvline(x=3, color='gray', linestyle=':', alpha=0.5)
plt.text(-4.5, 1, 'Extrapolation', fontsize=10, color='gray')
plt.text(3.2, 1, 'Extrapolation', fontsize=10, color='gray')

plt.xlabel('x')
plt.ylabel('y')
plt.title('MC Dropout: Uncertainty Estimation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xlim(-5, 5)
plt.ylim(-2, 2)
plt.show()

print("Note: Uncertainty increases in extrapolation regions (outside training data range)")

Dropoutの使い方

ドロップアウト率の選択

層の種類	推奨ドロップアウト率
全結合層	0.5（元論文の値）
畳み込み層	0.1〜0.25
最終層の前	0.2〜0.5
RNN/LSTM	0.2〜0.5（入力のみ）
Transformer	0.1

配置位置

# 一般的な配置
nn.Linear(in_features, out_features)
nn.ReLU()
nn.Dropout(p=0.5)  # 活性化関数の後

注意点

推論時は無効化: model.eval() を忘れずに
バッチ正規化との組み合わせ: 最近は両方使うことが減っている
残差接続との組み合わせ: DropPath（Stochastic Depth）が有効

まとめ

本記事では、Dropoutと正則化について解説しました。

Dropoutは学習時にランダムにユニットを無効化する正則化手法
アンサンブル学習やベイズ推論として理論的に解釈できる
L2正則化（Weight Decay）は重みを縮小、L1正則化はスパース化
MC Dropoutで予測の不確実性を推定できる
Dropout、L2正則化、その他のテクニックを組み合わせることが一般的

これで、深層学習の基本的な構成要素についての一連の記事が完了です。

機械学習と情報技術

Dropoutの理論とMC Dropoutによる不確実性推定

前提知識

過学習とは

過学習の問題

正則化の目的

Dropoutのアルゴリズム

基本的なアイデア

数学的定義

なぜ $1/(1-p)$ でスケーリングするのか

Inverted Dropout

Dropoutの理論的解釈

アンサンブル学習としての解釈

重み共有アンサンブル

ベイズ推論としての解釈

共適応の防止

L1/L2正則化

L2正則化（Weight Decay）

L1正則化（Lasso）

L1 vs L2 vs Dropout

Pythonでの実装

Dropoutのスクラッチ実装

PyTorchでの実装と比較

MC Dropout（不確実性推定）

Dropoutの使い方

ドロップアウト率の選択

配置位置

注意点

まとめ

データ拡張の手法一覧 — 画像・テキスト・時系列への適用

学習率スケジュールの種類と選び方

Dropoutの理論とMC Dropoutによる不確実性推定

前提知識

過学習とは

過学習の問題

正則化の目的

Dropoutのアルゴリズム

基本的なアイデア

数学的定義

なぜ $1/(1-p)$ でスケーリングするのか

Inverted Dropout

Dropoutの理論的解釈

アンサンブル学習としての解釈

重み共有アンサンブル

ベイズ推論としての解釈

共適応の防止

L1/L2正則化

L2正則化（Weight Decay）

L1正則化（Lasso）

L1 vs L2 vs Dropout

Pythonでの実装

Dropoutのスクラッチ実装

PyTorchでの実装と比較

MC Dropout（不確実性推定）

Dropoutの使い方

ドロップアウト率の選択

配置位置

注意点

まとめ

関連記事

【モデル圧縮】Pruning（枝刈り）の理論と実装

対照学習（Contrastive Learning）の理論と損失関数の導出

知識蒸留の理論 — 温度パラメータの数学的導出と実装

データ拡張の手法一覧 — 画像・テキスト・時系列への適用

学習率スケジュールの種類と選び方