学習率スケジュールの種類と選び方

学習率（Learning Rate）はニューラルネットワークの学習において最も重要なハイパーパラメータの一つです。学習率スケジュールは、学習の進行に応じて学習率を動的に変化させることで、収束速度と最終性能を向上させます。

本記事では、主要な学習率スケジュールの理論から実装まで詳しく解説します。

本記事の内容

各学習率スケジュールの数学的定義
Warmupの重要性
PyTorchでの実装と比較実験

前提知識

この記事を読む前に、以下の記事を読んでおくと理解が深まります。

オプティマイザ比較（Adam, AdamW, Lion）

学習率の役割

学習率が大きすぎる場合

更新が大きすぎて発散する
最適解を飛び越えて振動する
学習が不安定になる

学習率が小さすぎる場合

収束が非常に遅い
局所最適解に陥りやすい
計算資源の無駄

理想的な学習率スケジュール

学習初期: 適度に大きい学習率で高速に損失を下げる
学習中期: 徐々に学習率を下げて精度を上げる
学習終期: 小さい学習率で微調整

Step Decay

数学的定義

一定のエポック数ごとに学習率を定数倍します：

$$ \eta_t = \eta_0 \cdot \gamma^{\lfloor t / s \rfloor} $$

ここで： – $\eta_0$: 初期学習率 – $\gamma$: 減衰率（通常0.1） – $s$: ステップサイズ（エポック数） – $\lfloor \cdot \rfloor$: 床関数

特徴

利点	欠点
シンプルで実装が容易	急激な変化で学習が不安定に
理解しやすい	ステップ数の選択が難しい
広く使われている	滑らかでない

Exponential Decay

数学的定義

各エポックで学習率を指数的に減衰させます：

$$ \eta_t = \eta_0 \cdot \gamma^t $$

または連続版：

$$ \eta_t = \eta_0 \cdot e^{-\lambda t} $$

特徴

Step Decayよりも滑らかに減衰しますが、後半で学習率が急速に小さくなりすぎる問題があります。

Cosine Annealing

数学的定義

学習率をコサイン曲線に従って減衰させます：

$$ \eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} – \eta_{\min})\left(1 + \cos\left(\frac{t}{T} \pi\right)\right) $$

ここで： – $\eta_{\max}$: 最大学習率 – $\eta_{\min}$: 最小学習率 – $T$: 総エポック数 – $t$: 現在のエポック

コサイン関数の性質

$t = 0$ のとき $\cos(0) = 1$ なので $\eta_0 = \eta_{\max}$ $t = T$ のとき $\cos(\pi) = -1$ なので $\eta_T = \eta_{\min}$

Cosine Annealing with Warm Restarts

学習率を周期的にリセットする拡張版：

$$ \eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} – \eta_{\min})\left(1 + \cos\left(\frac{T_{\text{cur}}}{T_i} \pi\right)\right) $$

$T_i$ は $i$ 番目のリスタート周期、$T_{\text{cur}}$ は現周期内の経過エポックです。

特徴

利点	欠点
滑らかな減衰	$T$（総エポック数）を事前に決める必要
学習初期と終期で緩やかに変化	ハイパーパラメータが増える
Transformerで広く使用

Linear Warmup

数学的定義

学習の最初期に、学習率を0から徐々に増加させます：

$$ \eta_t = \begin{cases} \eta_{\max} \cdot \frac{t}{T_{\text{warmup}}} & \text{if } t < T_{\text{warmup}} \\ \eta_{\max} & \text{if } t \geq T_{\text{warmup}} \end{cases} $$

なぜWarmupが必要か

大規模バッチ: バッチサイズが大きいと勾配の分散が小さくなり、学習初期に大きなステップを踏みやすい
Transformerの安定化: Self-Attentionは初期化に敏感で、Warmupが安定化に寄与
適応的オプティマイザ: Adam等は初期のモメンタム推定が不安定

Warmup + Cosine Annealing

実践では、WarmupとCosine Annealingを組み合わせることが多いです：

$$ \eta_t = \begin{cases} \eta_{\max} \cdot \frac{t}{T_{\text{warmup}}} & \text{if } t < T_{\text{warmup}} \\ \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t - T_{\text{warmup}}}{T - T_{\text{warmup}}} \pi\right)\right) & \text{otherwise} \end{cases} $$

OneCycleLR

数学的定義

OneCycleLR（One Cycle Policy）は、学習全体を1つのサイクルとして、学習率を増加→減少させます。

フェーズ1（増加）: 最初の30-50%のイテレーションで学習率を $\eta_{\min}$ から $\eta_{\max}$ に増加

フェーズ2（減少）: 残りのイテレーションで $\eta_{\max}$ から $\eta_{\min}$ に減少

さらに、最後の数%で学習率を非常に小さくする「annihilation」フェーズを含むことも。

特徴

利点	欠点
Super-convergence（超高速収束）を実現	最大学習率の選択が重要
学習の全ステップ数を活用	総イテレーション数を事前に決める必要
多くのタスクで高性能

ReduceLROnPlateau

アルゴリズム

検証損失が改善しなくなったら学習率を下げる適応的なスケジュール：

各エポックで検証損失を監視
patience エポック連続で改善がなければ
学習率を factor 倍に減少

$$ \eta_{\text{new}} = \eta_{\text{old}} \times \text{factor} $$

特徴

利点	欠点
適応的で手動調整が少ない	検証セットが必要
過学習を検知して対応	反応が遅れる可能性
汎用性が高い

Pythonでの実装

各スケジュールの実装と可視化

import numpy as np
import matplotlib.pyplot as plt

def step_decay(epoch, initial_lr=0.1, drop_rate=0.1, epochs_drop=30):
    """Step Decay"""
    return initial_lr * (drop_rate ** (epoch // epochs_drop))

def exponential_decay(epoch, initial_lr=0.1, decay_rate=0.95):
    """Exponential Decay"""
    return initial_lr * (decay_rate ** epoch)

def cosine_annealing(epoch, total_epochs=100, lr_max=0.1, lr_min=0.0):
    """Cosine Annealing"""
    return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * epoch / total_epochs))

def warmup_cosine(epoch, total_epochs=100, warmup_epochs=10, lr_max=0.1, lr_min=0.0):
    """Warmup + Cosine Annealing"""
    if epoch < warmup_epochs:
        return lr_max * epoch / warmup_epochs
    else:
        progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
        return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * progress))

def one_cycle(epoch, total_epochs=100, lr_max=0.1, lr_min=0.001, pct_start=0.3):
    """OneCycleLR"""
    if epoch < total_epochs * pct_start:
        # 増加フェーズ
        progress = epoch / (total_epochs * pct_start)
        return lr_min + (lr_max - lr_min) * progress
    else:
        # 減少フェーズ
        progress = (epoch - total_epochs * pct_start) / (total_epochs * (1 - pct_start))
        return lr_max - (lr_max - lr_min) * progress

def cosine_warm_restarts(epoch, total_epochs=100, lr_max=0.1, lr_min=0.0, T_0=20, T_mult=2):
    """Cosine Annealing with Warm Restarts"""
    # 現在の周期を特定
    T_cur = epoch
    T_i = T_0

    while T_cur >= T_i:
        T_cur -= T_i
        T_i *= T_mult
        if T_i > total_epochs:
            T_i = total_epochs - epoch
            break

    return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * T_cur / T_i))

# 可視化
epochs = np.arange(100)

plt.figure(figsize=(14, 10))

schedules = {
    'Step Decay': [step_decay(e) for e in epochs],
    'Exponential Decay': [exponential_decay(e) for e in epochs],
    'Cosine Annealing': [cosine_annealing(e) for e in epochs],
    'Warmup + Cosine': [warmup_cosine(e) for e in epochs],
    'OneCycleLR': [one_cycle(e) for e in epochs],
    'Cosine Warm Restarts': [cosine_warm_restarts(e) for e in epochs],
}

for idx, (name, lr_values) in enumerate(schedules.items(), 1):
    plt.subplot(2, 3, idx)
    plt.plot(epochs, lr_values, linewidth=2)
    plt.xlabel('Epoch')
    plt.ylabel('Learning Rate')
    plt.title(name)
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

PyTorchでの使用例

import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import (
    StepLR, ExponentialLR, CosineAnnealingLR,
    CosineAnnealingWarmRestarts, OneCycleLR, ReduceLROnPlateau
)
import matplotlib.pyplot as plt

# ダミーモデル
model = nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1)

# 各スケジューラの学習率推移を取得
def get_lr_history(scheduler_fn, n_epochs=100):
    """スケジューラの学習率履歴を取得"""
    model = nn.Linear(10, 2)
    optimizer = optim.SGD(model.parameters(), lr=0.1)
    scheduler = scheduler_fn(optimizer)

    lr_history = []
    for epoch in range(n_epochs):
        lr_history.append(optimizer.param_groups[0]['lr'])
        # ダミーの学習ステップ
        optimizer.step()
        scheduler.step()

    return lr_history

# 各スケジューラを作成
schedulers = {
    'StepLR': lambda opt: StepLR(opt, step_size=30, gamma=0.1),
    'ExponentialLR': lambda opt: ExponentialLR(opt, gamma=0.95),
    'CosineAnnealingLR': lambda opt: CosineAnnealingLR(opt, T_max=100, eta_min=0.001),
    'CosineAnnealingWarmRestarts': lambda opt: CosineAnnealingWarmRestarts(opt, T_0=20, T_mult=2),
}

# 可視化
plt.figure(figsize=(12, 8))

for idx, (name, scheduler_fn) in enumerate(schedulers.items(), 1):
    lr_history = get_lr_history(scheduler_fn)
    plt.subplot(2, 2, idx)
    plt.plot(lr_history, linewidth=2)
    plt.xlabel('Epoch')
    plt.ylabel('Learning Rate')
    plt.title(f'PyTorch {name}')
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Warmup付きスケジューラの自作

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt

class WarmupCosineScheduler:
    """Warmup + Cosine Annealing スケジューラ"""

    def __init__(self, optimizer, warmup_epochs, total_epochs, lr_max, lr_min=0.0):
        self.optimizer = optimizer
        self.warmup_epochs = warmup_epochs
        self.total_epochs = total_epochs
        self.lr_max = lr_max
        self.lr_min = lr_min
        self.current_epoch = 0

    def step(self):
        if self.current_epoch < self.warmup_epochs:
            # Warmupフェーズ
            lr = self.lr_max * self.current_epoch / self.warmup_epochs
        else:
            # Cosine Annealingフェーズ
            progress = (self.current_epoch - self.warmup_epochs) / (self.total_epochs - self.warmup_epochs)
            lr = self.lr_min + 0.5 * (self.lr_max - self.lr_min) * (1 + np.cos(np.pi * progress))

        for param_group in self.optimizer.param_groups:
            param_group['lr'] = lr

        self.current_epoch += 1

    def get_lr(self):
        return self.optimizer.param_groups[0]['lr']

# 使用例
model = nn.Linear(10, 2)
optimizer = optim.Adam(model.parameters(), lr=0.001)
scheduler = WarmupCosineScheduler(
    optimizer,
    warmup_epochs=10,
    total_epochs=100,
    lr_max=0.001,
    lr_min=1e-6
)

lr_history = []
for epoch in range(100):
    lr_history.append(scheduler.get_lr())
    scheduler.step()

plt.figure(figsize=(10, 5))
plt.plot(lr_history, linewidth=2)
plt.axvline(x=10, color='red', linestyle='--', label='End of Warmup')
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.title('Warmup + Cosine Annealing')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

実験：各スケジュールの学習曲線比較

import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR, CosineAnnealingLR, OneCycleLR
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
import matplotlib.pyplot as plt

def create_dataset(n_samples=5000, input_dim=20, num_classes=10):
    np.random.seed(42)
    X = np.random.randn(n_samples, input_dim).astype(np.float32)
    W = np.random.randn(input_dim, num_classes).astype(np.float32)
    logits = X @ W + np.random.randn(n_samples, num_classes).astype(np.float32) * 0.3
    y = np.argmax(logits, axis=1)
    split = int(0.8 * n_samples)
    return (X[:split], y[:split]), (X[split:], y[split:])

class MLP(nn.Module):
    def __init__(self, input_dim=20, hidden_dim=128, num_classes=10):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_classes)
        )

    def forward(self, x):
        return self.net(x)

def train_with_scheduler(scheduler_name, train_loader, test_loader, n_epochs=100):
    torch.manual_seed(42)
    model = MLP()
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

    # スケジューラの選択
    if scheduler_name == 'No Schedule':
        scheduler = None
    elif scheduler_name == 'StepLR':
        scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
    elif scheduler_name == 'CosineAnnealingLR':
        scheduler = CosineAnnealingLR(optimizer, T_max=n_epochs, eta_min=0.001)
    elif scheduler_name == 'OneCycleLR':
        scheduler = OneCycleLR(
            optimizer,
            max_lr=0.1,
            epochs=n_epochs,
            steps_per_epoch=len(train_loader)
        )
    else:
        raise ValueError(f"Unknown scheduler: {scheduler_name}")

    train_losses = []
    test_accuracies = []
    lr_history = []

    for epoch in range(n_epochs):
        # Training
        model.train()
        epoch_loss = 0
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            loss.backward()
            optimizer.step()

            if scheduler_name == 'OneCycleLR':
                scheduler.step()

            epoch_loss += loss.item()

        train_losses.append(epoch_loss / len(train_loader))
        lr_history.append(optimizer.param_groups[0]['lr'])

        # スケジューラの更新（OneCycleLR以外）
        if scheduler is not None and scheduler_name != 'OneCycleLR':
            scheduler.step()

        # Evaluation
        model.eval()
        correct = 0
        total = 0
        with torch.no_grad():
            for X_batch, y_batch in test_loader:
                outputs = model(X_batch)
                _, predicted = torch.max(outputs, 1)
                total += y_batch.size(0)
                correct += (predicted == y_batch).sum().item()
        test_accuracies.append(correct / total)

    return train_losses, test_accuracies, lr_history

# データ準備
(X_train, y_train), (X_test, y_test) = create_dataset()
train_dataset = TensorDataset(torch.tensor(X_train), torch.tensor(y_train))
test_dataset = TensorDataset(torch.tensor(X_test), torch.tensor(y_test))
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# 各スケジューラで学習
schedulers = ['No Schedule', 'StepLR', 'CosineAnnealingLR', 'OneCycleLR']
results = {}

for name in schedulers:
    print(f"Training with {name}...")
    train_losses, test_accs, lr_history = train_with_scheduler(name, train_loader, test_loader)
    results[name] = {
        'train_loss': train_losses,
        'test_acc': test_accs,
        'lr_history': lr_history
    }
    print(f"  Final Test Accuracy: {test_accs[-1]:.4f}")

# 可視化
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

colors = {'No Schedule': 'gray', 'StepLR': 'blue', 'CosineAnnealingLR': 'green', 'OneCycleLR': 'red'}

# 学習率
ax1 = axes[0]
for name in schedulers:
    ax1.plot(results[name]['lr_history'], label=name, color=colors[name], linewidth=1.5)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Learning Rate')
ax1.set_title('Learning Rate Schedule')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_yscale('log')

# 訓練損失
ax2 = axes[1]
for name in schedulers:
    ax2.plot(results[name]['train_loss'], label=name, color=colors[name], linewidth=1.5)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Training Loss')
ax2.set_title('Training Loss')
ax2.legend()
ax2.grid(True, alpha=0.3)

# テスト精度
ax3 = axes[2]
for name in schedulers:
    ax3.plot(results[name]['test_acc'], label=name, color=colors[name], linewidth=1.5)
ax3.set_xlabel('Epoch')
ax3.set_ylabel('Test Accuracy')
ax3.set_title('Test Accuracy')
ax3.legend()
ax3.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

学習率スケジュールの選び方

一般的な推奨

タスク	推奨スケジュール
画像分類（CNN）	Cosine Annealing or Step Decay
Transformer（NLP）	Warmup + Cosine Annealing
大規模事前学習	Warmup + Linear Decay
ファインチューニング	Cosine Annealing (小さいlr)
素早い実験	OneCycleLR

パラメータ選択の指針

パラメータ	指針
初期学習率	LR Range Testで探索
Warmup期間	全体の5-10%
最小学習率	初期の1/100〜1/1000

まとめ

本記事では、学習率スケジュールについて解説しました。

Step Decayはシンプルだが、急激な変化で不安定になりやすい
Cosine Annealingは滑らかで、Transformerで広く使用される
Warmupは学習初期の安定化に重要、特に大規模モデルで必須
OneCycleLRは超高速収束を実現し、多くのタスクで有効

次のステップとして、以下の記事も参考にしてください。

Batch Normalizationの理論と実装

学習率スケジュールの種類と選び方

前提知識

学習率の役割

学習率が大きすぎる場合

学習率が小さすぎる場合

理想的な学習率スケジュール

Step Decay

数学的定義

特徴

Exponential Decay

数学的定義

特徴

Cosine Annealing

数学的定義

コサイン関数の性質

Cosine Annealing with Warm Restarts

特徴

Linear Warmup

数学的定義

なぜWarmupが必要か

Warmup + Cosine Annealing

OneCycleLR

数学的定義

特徴

ReduceLROnPlateau

アルゴリズム

特徴

Pythonでの実装

各スケジュールの実装と可視化

PyTorchでの使用例

Warmup付きスケジューラの自作

実験：各スケジュールの学習曲線比較

学習率スケジュールの選び方

一般的な推奨

パラメータ選択の指針

まとめ

Dropoutの理論とMC Dropoutによる不確実性推定

極限の定義（ε-δ論法）をわかりやすく解説

学習率スケジュールの種類と選び方

前提知識

学習率の役割

学習率が大きすぎる場合

学習率が小さすぎる場合

理想的な学習率スケジュール

Step Decay

数学的定義

特徴

Exponential Decay

数学的定義

特徴

Cosine Annealing

数学的定義

コサイン関数の性質

Cosine Annealing with Warm Restarts

特徴

Linear Warmup

数学的定義

なぜWarmupが必要か

Warmup + Cosine Annealing

OneCycleLR

数学的定義

特徴

ReduceLROnPlateau

アルゴリズム

特徴

Pythonでの実装

各スケジュールの実装と可視化

PyTorchでの使用例

Warmup付きスケジューラの自作

実験：各スケジュールの学習曲線比較

学習率スケジュールの選び方

一般的な推奨

パラメータ選択の指針

まとめ

関連記事

Self-Attention機構の理論と実装を完全解説

Multi-Head Attentionの理論と実装を完全解説

Attention Is All You Need（Transformer原論文）を徹底解読

Dropoutの理論とMC Dropoutによる不確実性推定

極限の定義（ε-δ論法）をわかりやすく解説