如何使用 TensorFlow 將不規則張量的單詞程式碼點分割回句子？

不規則張量的單詞程式碼點可以透過以下方法進行分割：分割是指將文字分割成類似單詞的單元。這用於空格字元用於分隔單詞的情況，但一些語言如中文和日語不使用空格。一些語言，如德語，包含需要分割才能分析其含義的長複合詞。

單詞的程式碼點被分割回句子。下一步是檢查單詞中字元的程式碼點是否存在於句子中。如果存在，則建立一個不規則張量，並將句子編碼回標準編碼。

讓我們瞭解如何使用 Python 表示 Unicode 字串，並使用 Unicode 等效項操作它們。首先，我們藉助標準字串操作的 Unicode 等效項，根據指令碼檢測將 Unicode 字串分割成標記。

我們使用 Google Colaboratory 來執行以下程式碼。Google Colab 或 Colaboratory 幫助在瀏覽器上執行 Python 程式碼，無需任何配置，並可免費訪問 GPU（圖形處理單元）。Colaboratory 是在 Jupyter Notebook 之上構建的。

print("Segment the word code points back to sentences")
print("Check if code point for a character in a word is present in the sentence")
sentence_word_char_codepoint = tf.RaggedTensor.from_row_lengths(
   values=word_char_codepoint,
   row_lengths=sentence_num_words)
print(sentence_word_char_codepoint)
print("Encoding it back to UTF-8")
tf.strings.unicode_encode(sentence_word_char_codepoint, 'UTF-8').to_list()

程式碼來源：https://www.tensorflow.org/tutorials/load_data/unicode

輸出

Segment the word code points back to sentences
Check if code point for a character in a word is present in the sentence
<tf.RaggedTensor [[[72, 101, 108, 108, 111], [44, 32], [116, 104, 101, 114, 101], [46]], [[19990, 30028], [12371, 12435, 12395, 12385, 12399]]]>
Encoding it back to UTF-8
[[b'Hello', b', ', b'there', b'.'],
[b'\xe4\xb8\x96\xe7\x95\x8c',
   b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf']]

解釋

程式碼點被分割成句子。
確定句子中是否存在字元的程式碼點。
解碼後的資料被編碼回 UTF-8 編碼。

AmitDiwan

更新於： 2021 年 2 月 20 日

74 次瀏覽

開啟你的職業生涯

透過完成課程獲得認證

開始學習

如何使用 TensorFlow 將不規則張量的單詞程式碼點分割回句子？

輸出

解釋

開啟你的 職業生涯

開啟你的職業生涯