建立模型

15 分鐘

現在波形已轉換為頻譜張量，你可以訓練卷積神經網路（CNN）。 CNN在頻譜圖分類中表現良好，因為頻譜圖是一種二維表示，具有局部時間與頻率的模式。

本單元的程式碼使用了前一單元所建立的以下物件：

train_spectrogram_ds
val_spectrogram_ds
test_spectrogram_ds
label_names
get_spectrogram
BINARY_DATASET_PATH

如果你在完整模組流程之外執行程式碼，請先執行前一單元的設定和預處理程式碼。

檢查模型輸入

在建立模型之前，檢查一個批次，以取得頻譜圖的輸入形狀和標籤的數量。

import pathlib

import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import models

for example_spectrograms, example_labels in train_spectrogram_ds.take(1):
    input_shape = example_spectrograms.shape[1:]
    break

num_labels = len(label_names)

print("Input shape:", input_shape)
print("Number of labels:", num_labels)
print("Labels:", label_names)

預期產出： 頻譜圖輸入形狀應與前一單元產生的形狀相符，標籤應為 no 和 yes。

Input shape: (124, 129, 1)
Number of labels: 2
Labels: ['no' 'yes']

建立模型

模型從一個調整大小的層開始，將每個 (124, 129, 1) 頻譜圖降採樣為 (32, 32, 1)。較小的輸入使得訓練速度更快，但代價是頻率與時間解析度下降;這種取捨適用於本模組中的二元任務。接著是正規化層。正規化層在訓練開始前，透過呼叫 adapt來學習訓練頻譜圖的均值與標準差。

normalization_layer = layers.Normalization()
normalization_layer.adapt(
    data=train_spectrogram_ds.map(lambda spectrogram, label: spectrogram)
)

model = models.Sequential([
    layers.Input(shape=input_shape),
    layers.Resizing(32, 32),
    normalization_layer,
    layers.Conv2D(32, 3, activation="relu"),
    layers.Conv2D(64, 3, activation="relu"),
    layers.MaxPooling2D(),
    layers.Dropout(0.25),
    layers.Flatten(),
    layers.Dense(128, activation="relu"),
    layers.Dropout(0.5),
    layers.Dense(num_labels),
])

model.summary()

預期產出： 模型摘要列出輸入層、調整大小與正規化層、兩個卷積層、池化、dropout、flattening（平面化）及密集輸出層。最後的稠密層有兩個輸出，分別對應一個類別。

編譯並訓練模型

使用 Adam 優化器並使用進行稀疏類別交叉熵計算。儘管任務包含兩個類別，此模型使用兩個輸出logit，一個用於 no，另一個用於 yes，而非使用一個sigmoid輸出。該設計與前一單元中 label_mode="int" 所建立的整數標籤相符。模型的最後一 Dense 層輸出原始 logit（沒有 softmax 啟動），所以請設定 from_logits=True 讓損失函數內部應用具數值穩定性的 softmax。

model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)

history = model.fit(
    train_spectrogram_ds,
    validation_data=val_spectrogram_ds,
    epochs=10,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(
            monitor="val_loss",
            patience=2,
            restore_best_weights=True,
        )
    ],
)

預期產出： TensorFlow 每個訓練週期列印一行，包含訓練損失、訓練準確度、驗證損失及驗證準確度。精確值會因硬體和隨機初始化而異，但對於這個雙類問題，準確度在前幾個時期應該會提升，驗證的準確度也應該會遠勝於隨機猜測。

繪製訓練歷程

繪製損失曲線與準確度曲線，以檢查模型在訓練過程中是否有所改善。

metrics = history.history

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(history.epoch, metrics["loss"], label="Training loss")
plt.plot(history.epoch, metrics["val_loss"], label="Validation loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.epoch, metrics["accuracy"], label="Training accuracy")
plt.plot(history.epoch, metrics["val_accuracy"], label="Validation accuracy")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend()

plt.show()

預期產出： 損失圖通常會呈下降趨勢。準確度圖表通常會呈上升趨勢。如果訓練準確度持續提升，而驗證準確度卻變差，模型就是過度擬合。

在測試集中評估

訓練結束後，請使用測試集。這能更好地估算模型在訓練或驗證時未見的測試資料上的表現。

test_metrics = model.evaluate(test_spectrogram_ds, return_dict=True)
print(test_metrics)

預期產出： TensorFlow 會列印最終的測試損失與測試準確度。精確數值會有所不同，但結果應遠高於平衡雙類資料集隨機猜測所期望的50%準確率。

你也可以檢查混淆矩陣，看看模型混淆了哪些類別。

predicted_batches = []
true_batches = []

for spectrograms, labels in test_spectrogram_ds:
    logits = model(spectrograms, training=False)
    predicted_batches.append(tf.argmax(logits, axis=1))
    true_batches.append(labels)

predicted_labels = tf.concat(predicted_batches, axis=0)
true_labels = tf.concat(true_batches, axis=0)

confusion_matrix = tf.math.confusion_matrix(
    true_labels,
    predicted_labels,
    num_classes=num_labels,
)
print(confusion_matrix.numpy())

預期產出： 輸出為 2 x 2 矩陣。當模型正確分類noyes時，大多數計數應該會出現在對角線上。

對一個音訊檔案執行推論

要分類單一 WAV 檔案，需載入與訓練相同的預處理：一個通道、16,000 個樣本、STFT 大小及通道尺寸。

def load_waveform(file_path):
    audio_binary = tf.io.read_file(str(file_path))
    waveform, sample_rate = tf.audio.decode_wav(
        audio_binary,
        desired_channels=1,
        desired_samples=16000,
    )
    waveform = tf.squeeze(waveform, axis=-1)
    return waveform, sample_rate


sample_file = next((BINARY_DATASET_PATH / "no").glob("*.wav"))

sample_waveform, sample_rate = load_waveform(sample_file)
sample_spectrogram = get_spectrogram(sample_waveform)

logits = model(sample_spectrogram[tf.newaxis, ...], training=False)
predicted_index = tf.argmax(logits[0]).numpy()
predicted_label = label_names[predicted_index]

print("Sample file:", sample_file)
print("Predicted label:", predicted_label)

預期產出： 預測通常與所選範例檔案的資料夾名稱相符。由於此範例從資料夾中選擇檔案 no ，模型通常預測 no。預測有時會出錯，尤其是在模型尚未完全收斂之前，或是訓練中被誤分類的片段。

可選：測試自己的聲音

你可以用自己的 WAV 檔案測試模型。錄下自己說「是」和「不是」的短片段。每個片段保持在接近一秒鐘，並盡量減少背景噪音。將檔案匯出為 16 kHz 單聲道 16 位元 PCM WAV 檔案，使其與訓練資料相符，並可由 tf.audio.decode_wav 進行解碼。如果你的錄音工具匯出取樣率不同，請先將檔案重新取樣到 16 kHz 再使用這段程式碼。 desired_samples=16000參數填充或裁剪取樣;它不會將 44.1 kHz 或 48 kHz 的錄音轉換成 16 kHz 音訊。更新路徑 custom_files 以符合你建立的檔案。

def load_voice_sample(file_path):
    audio_binary = tf.io.read_file(str(file_path))
    waveform, sample_rate = tf.audio.decode_wav(
        audio_binary,
        desired_channels=1,
        desired_samples=16000,
    )

    if int(sample_rate.numpy()) != 16000:
        raise ValueError("Use a 16 kHz WAV file, or resample the audio to 16 kHz before inference.")

    waveform = tf.squeeze(waveform, axis=-1)
    return waveform


custom_files = {
    "no": pathlib.Path("data/myvoice/no.wav"),
    "yes": pathlib.Path("data/myvoice/yes.wav"),
}

missing_files = [file_path for file_path in custom_files.values() if not file_path.exists()]

if missing_files:
    print("Create these WAV files before running the optional custom-voice example:")
    for file_path in missing_files:
        print(file_path)
else:
    for expected_label, file_path in custom_files.items():
        waveform = load_voice_sample(file_path)
        spectrogram = get_spectrogram(waveform)
        logits = model(spectrogram[tf.newaxis, ...], training=False)
        predicted_label = label_names[tf.argmax(logits[0]).numpy()]

        print(f"Expected: {expected_label}; predicted: {predicted_label}")

預期產出： 如果檔案還不存在，程式碼會印出要建立的路徑。建立檔案後，程式碼會為每個自訂檔案印出一個預測值。你自己聲音的準確度可能比測試組還低，因為模型是根據語音指令錄音訓練的，而非在錄音環境或麥克風上。

意見反應

此頁面對您有幫助嗎？