本頁面由 Cloud Translation API 翻譯而成。

JavaScript 中的 Base64 編碼字串細微差異

Matt Joseph

Base64 編碼和解碼是將二進位內容轉換為網路安全文字的常見方式。通常用於資料網址 (例如內嵌圖片)。

在 JavaScript 中將 base64 編碼和解碼套用至字串時，會發生什麼情況？本文將探討注意事項與應避免的常見陷阱。

btoa() 和 atob()

在 JavaScript 中，用於 base64 編碼和解碼的核心函式為 btoa() 和 atob()。btoa() 會將字串轉換為 Base64 編碼字串，而 atob() 會將其解碼。

以下範例為簡易範例：

// A really plain string that is just code points below 128.
const asciiString = 'hello';

// This will work. It will print:
// Encoded string: [aGVsbG8=]
const asciiStringEncoded = btoa(asciiString);
console.log(`Encoded string: [${asciiStringEncoded}]`);

// This will work. It will print:
// Decoded string: [hello]
const asciiStringDecoded = atob(asciiStringEncoded);
console.log(`Decoded string: [${asciiStringDecoded}]`);

不過，如同 MDN 說明文件指出，這個方法只適用於包含 ASCII 字元的字串，或可以以單一位元組表示的字元。換句話說，這不適用於 Unicode。

如要瞭解情況，請嘗試以下程式碼：

// Sample string that represents a combination of small, medium, and large code points.
// This sample string is valid UTF-16.
// 'hello' has code points that are each below 128.
// '⛳' is a single 16-bit code units.
// '❤️' is a two 16-bit code units, U+2764 and U+FE0F (a heart and a variant).
// '🧀' is a 32-bit code point (U+1F9C0), which can also be represented as the surrogate pair of two 16-bit code units '\ud83e\uddc0'.
const validUTF16String = 'hello⛳❤️🧀';

// This will not work. It will print:
// DOMException: Failed to execute 'btoa' on 'Window': The string to be encoded contains characters outside of the Latin1 range.
try {
  const validUTF16StringEncoded = btoa(validUTF16String);
  console.log(`Encoded string: [${validUTF16StringEncoded}]`);
} catch (error) {
  console.log(error);
}

字串中的任何表情符號都會導致錯誤。為什麼 Unicode 會導致這個問題？

讓我們先退一步，瞭解在電腦科學和 JavaScript 中都適用的字串。

Unicode 和 JavaScript 中的字串

「Unicode」是字元編碼的現行全域標準，或將數字指派給特定字元的做法，以便於電腦系統使用。如要深入瞭解萬國碼，請參閱這篇 W3C 文章。

以下列舉一些 Unicode 字元和相關數字的例子：

小時 - 104
ñ - 241
❤ - 2764
❤️ - 2764，隱藏修飾符號編號為 65039
⛳ - 9971
🧀 - 129472

代表每個字元的數字稱為「碼點」。您可以將「碼位」視為每個字元的地址。在紅心表情符號中，實際上有兩個代碼點：一個是心形圖示，另一個是用來「變更」顏色，並讓圖示一律顯示為紅色。

Unicode 有兩種常見方式，可將這些碼點轉換為電腦可一致解讀的位元組序列：UTF-8 和 UTF-16。

簡單來說，這就是這個系統的運作方式：

在 UTF-8 中，一個代碼點可使用 1 到 4 個位元組 (每個位元組 8 位元)。
在 UTF-16 中，一個碼點一律是兩個位元組 (16 位元)。

重要的是，JavaScript 會以 UTF-16 處理字串。這會破壞 btoa() 等函式，因為這些函式會假設字串中的每個字元會對應至單一位元組，並據此有效運作。MDN 明確指出這項資訊：

btoa() 方法會從二進位字串 (即字串中的每個字元都視為二進位資料的位元組) 建立 Base64 編碼的 ASCII 字串。

您現在已瞭解 JavaScript 中的字元通常需要超過一個位元組，下一節將說明如何處理這種情況，以便進行 base64 編碼和解碼。

使用 Unicode 的 btoa() 和 atob()

如您所知，系統擲回錯誤是因為字串中包含的字元位於 UTF-16 的單一位元組之外。

幸好，MDN 的 base64 相關文章提供了一些實用程式碼範例，可解決這個「Unicode 問題」。您可以修改這段程式碼，讓它與前述範例搭配使用：

// From https://developer.mozilla.org/en-US/docs/Glossary/Base64#the_unicode_problem.
function base64ToBytes(base64) {
  const binString = atob(base64);
  return Uint8Array.from(binString, (m) => m.codePointAt(0));
}

// From https://developer.mozilla.org/en-US/docs/Glossary/Base64#the_unicode_problem.
function bytesToBase64(bytes) {
  const binString = String.fromCodePoint(...bytes);
  return btoa(binString);
}

// Sample string that represents a combination of small, medium, and large code points.
// This sample string is valid UTF-16.
// 'hello' has code points that are each below 128.
// '⛳' is a single 16-bit code units.
// '❤️' is a two 16-bit code units, U+2764 and U+FE0F (a heart and a variant).
// '🧀' is a 32-bit code point (U+1F9C0), which can also be represented as the surrogate pair of two 16-bit code units '\ud83e\uddc0'.
const validUTF16String = 'hello⛳❤️🧀';

// This will work. It will print:
// Encoded string: [aGVsbG/im7PinaTvuI/wn6eA]
const validUTF16StringEncoded = bytesToBase64(new TextEncoder().encode(validUTF16String));
console.log(`Encoded string: [${validUTF16StringEncoded}]`);

// This will work. It will print:
// Decoded string: [hello⛳❤️🧀]
const validUTF16StringDecoded = new TextDecoder().decode(base64ToBytes(validUTF16StringEncoded));
console.log(`Decoded string: [${validUTF16StringDecoded}]`);

下列步驟說明這段程式碼如何編碼字串：

使用 TextEncoder 介面，將 UTF-16 編碼的 JavaScript 字串轉換為 UTF-8 編碼位元組的資料流，並使用 TextEncoder.encode() 進行轉換。
這會回傳 Uint8Array，這是 JavaScript 中較不常用的資料類型，是 TypedArray 的子類別。
請取出該 Uint8Array，並將其提供給 bytesToBase64() 函式，該函式會使用 String.fromCodePoint() 將 Uint8Array 中的每個位元組視為碼點，並根據該碼點建立字串，進而產生可全部以單一位元組表示的碼點字串。
請取出該字串，並使用 btoa() 進行 base64 編碼。

解碼程序則是相反的過程。

這是因為 Uint8Array 和字串之間的步驟可確保在 JavaScript 中以 UTF-16 兩位元組編碼表示字串時，每個兩位元組代表的代碼點一律會小於 128。

這段程式碼在大多數情況下都能正常運作，但在其他情況下會悄悄失敗。

靜音失敗案例

使用相同程式碼，但使用不同的字串：

// From https://developer.mozilla.org/en-US/docs/Glossary/Base64#the_unicode_problem.
function base64ToBytes(base64) {
  const binString = atob(base64);
  return Uint8Array.from(binString, (m) => m.codePointAt(0));
}

// From https://developer.mozilla.org/en-US/docs/Glossary/Base64#the_unicode_problem.
function bytesToBase64(bytes) {
  const binString = String.fromCodePoint(...bytes);
  return btoa(binString);
}

// Sample string that represents a combination of small, medium, and large code points.
// This sample string is invalid UTF-16.
// 'hello' has code points that are each below 128.
// '⛳' is a single 16-bit code units.
// '❤️' is a two 16-bit code units, U+2764 and U+FE0F (a heart and a variant).
// '🧀' is a 32-bit code point (U+1F9C0), which can also be represented as the surrogate pair of two 16-bit code units '\ud83e\uddc0'.
// '\uDE75' is code unit that is one half of a surrogate pair.
const partiallyInvalidUTF16String = 'hello⛳❤️🧀\uDE75';

// This will work. It will print:
// Encoded string: [aGVsbG/im7PinaTvuI/wn6eA77+9]
const partiallyInvalidUTF16StringEncoded = bytesToBase64(new TextEncoder().encode(partiallyInvalidUTF16String));
console.log(`Encoded string: [${partiallyInvalidUTF16StringEncoded}]`);

// This will work. It will print:
// Decoded string: [hello⛳❤️🧀�]
const partiallyInvalidUTF16StringDecoded = new TextDecoder().decode(base64ToBytes(partiallyInvalidUTF16StringEncoded));
console.log(`Decoded string: [${partiallyInvalidUTF16StringDecoded}]`);

如果您在解碼 ( ) 後取用最後一個字元，並檢查其十六進位值，您會發現這是 \uFFFD，而非原始 \uDE75。不會失敗或擲回錯誤，但輸入和輸出資料已自動變更。為什麼？

字串會因 JavaScript API 而異

如前所述，JavaScript 處理字串為 UTF-16。不過，UTF-16 字串具有獨特的屬性。

以起司表情符號為例，表情符號 (🧀) 的 Unicode 代碼點為 129472。遺憾的是，16 位元數字的最大值是 65535！所以 UTF-16 如何代表較高的數字？

UTF-16 有一個稱為「替代字元組合」的概念。您可以這樣想：

組合中的首個數字會指定要搜尋的「書籍」。這稱為「代理字」。
組合中的第二個數字是「書籍」中的項目。

如您所想，如果只提供代表書籍的號碼，而非該書籍中的實際項目，有時可能會發生問題。在 UTF-16 中，則稱為「孤獨代理值」。

這在 JavaScript 中特別困難，因為有些 API 即使有單一代理程式也能運作，但其他 API 則會失敗。

在這種情況下，您會使用 TextDecoder 從 base64 解碼。具體來說，TextDecoder 的預設值指定下列項目：

預設值為 false，表示解碼器會使用替換字元取代格式錯誤的資料。

您先前觀察到的這個字元 (以十六進位表示 \uFFFD) 就是替換字元。在 UTF-16 中，含有孤立替代字元的字串會被視為「格式錯誤」或「格式不正確」。

有各種網路標準 (例如 1、2、3、4) 可明確指定錯誤格式字串何時會影響 API 行為，其中 TextDecoder 就是其中一個 API。建議您在進行文字處理作業前，先確認字串格式正確無誤。

檢查字串是否格式正確

最新版本的瀏覽器現在已提供此用途的函式：isWellFormed()。

瀏覽器支援

資料來源

您可以使用 encodeURIComponent() 達成類似的結果，如果字串包含單一替代字元，則會擲回 URIError 錯誤。

以下函式會在可用時使用 isWellFormed()，在無法使用時使用 encodeURIComponent()。您可以使用類似的程式碼，為 isWellFormed() 建立 polyfill。

// Quick polyfill since older browsers do not support isWellFormed().
// encodeURIComponent() throws an error for lone surrogates, which is essentially the same.
function isWellFormed(str) {
  if (typeof(str.isWellFormed)!="undefined") {
    // Use the newer isWellFormed() feature.
    return str.isWellFormed();
  } else {
    // Use the older encodeURIComponent().
    try {
      encodeURIComponent(str);
      return true;
    } catch (error) {
      return false;
    }
  }
}

馬上開始全面整合吧！

您現在已瞭解如何處理 Unicode 和單一代理字元，因此可以將所有內容組合起來，建立可處理所有情況的程式碼，且不會在無聲文字替換的情況下執行。

// From https://developer.mozilla.org/en-US/docs/Glossary/Base64#the_unicode_problem.
function base64ToBytes(base64) {
  const binString = atob(base64);
  return Uint8Array.from(binString, (m) => m.codePointAt(0));
}

// From https://developer.mozilla.org/en-US/docs/Glossary/Base64#the_unicode_problem.
function bytesToBase64(bytes) {
  const binString = String.fromCodePoint(...bytes);
  return btoa(binString);
}

// Quick polyfill since Firefox and Opera do not yet support isWellFormed().
// encodeURIComponent() throws an error for lone surrogates, which is essentially the same.
function isWellFormed(str) {
  if (typeof(str.isWellFormed)!="undefined") {
    // Use the newer isWellFormed() feature.
    return str.isWellFormed();
  } else {
    // Use the older encodeURIComponent().
    try {
      encodeURIComponent(str);
      return true;
    } catch (error) {
      return false;
    }
  }
}

const validUTF16String = 'hello⛳❤️🧀';
const partiallyInvalidUTF16String = 'hello⛳❤️🧀\uDE75';

if (isWellFormed(validUTF16String)) {
  // This will work. It will print:
  // Encoded string: [aGVsbG/im7PinaTvuI/wn6eA]
  const validUTF16StringEncoded = bytesToBase64(new TextEncoder().encode(validUTF16String));
  console.log(`Encoded string: [${validUTF16StringEncoded}]`);

  // This will work. It will print:
  // Decoded string: [hello⛳❤️🧀]
  const validUTF16StringDecoded = new TextDecoder().decode(base64ToBytes(validUTF16StringEncoded));
  console.log(`Decoded string: [${validUTF16StringDecoded}]`);
} else {
  // Not reached in this example.
}

if (isWellFormed(partiallyInvalidUTF16String)) {
  // Not reached in this example.
} else {
  // This is not a well-formed string, so we handle that case.
  console.log(`Cannot process a string with lone surrogates: [${partiallyInvalidUTF16String}]`);
}

這個程式碼可以進行許多最佳化，例如將其泛化為 polyfill、變更 TextDecoder 參數以擲回，而不是靜默替換單一代理程式等等。

有了這些知識和程式碼，您也可以明確決定如何處理格式不正確的字串，例如拒絕資料、明確啟用資料替換，或擲回錯誤以利日後分析。

除了提供 base64 編碼和解碼的實用範例，這篇文章也說明了為何謹慎處理文字資料特別重要，尤其是當文字資料來自使用者產生的內容或外部來源時。