如何在 TensorFlow 和 Python 中分割 Unicode 字串並指定位元組偏移量？

可以使用 ‘unicode_split’ 方法和 ‘unicode_decode_with_offsets’ 方法分別分割 Unicode 字串並指定位元組偏移量。這些方法存在於 ‘tensorflow’ 模組的 ‘string’ 類中。

首先，使用 Python 表示 Unicode 字串，並使用 Unicode 等價物操作它們。藉助 Unicode 等價的標準字串操作，根據指令碼檢測將 Unicode 字串分割成標記。

我們正在使用 Google Colaboratory 來執行以下程式碼。Google Colab 或 Colaboratory 幫助在瀏覽器上執行 Python 程式碼，無需任何配置，並且可以免費訪問 GPU（圖形處理單元）。Colaboratory 建立在 Jupyter Notebook 之上。

print("Split unicode strings")
tf.strings.unicode_split(thanks, 'UTF-8').numpy()
codepoints, offsets = tf.strings.unicode_decode_with_offsets(u"🎈🎉🎊", 'UTF-8')
print("Printing byte offset for characters")
for (codepoint, offset) in zip(codepoints.numpy(), offsets.numpy()):
   print("At byte offset {}: codepoint {}".format(offset, codepoint))

程式碼來源： https://www.tensorflow.org/tutorials/load_data/unicode

輸出

Split unicode strings
Printing byte offset for characters
At byte offset 0: codepoint 127880
At byte offset 4: codepoint 127881
At byte offset 8: codepoint 127882

解釋

tf.strings.unicode_split 操作將 Unicode 字串分割成單個字元的子字串。
生成的字元張量必須透過 tf.strings.unicode_decode 與原始字串對齊。
為此，需要知道每個字元開始的偏移量。
tf.strings.unicode_decode_with_offsets 方法類似於 unicode_decode 方法，不同之處在於前者返回第二個張量，其中包含每個字元的起始偏移量。

AmitDiwan

更新於： 2021年2月20日

597 次檢視

開啟你的職業生涯

透過完成課程獲得認證

開始學習

如何在 TensorFlow 和 Python 中分割 Unicode 字串並指定位元組偏移量？

輸出

解釋

開啟你的 職業生涯

開啟你的職業生涯