Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.


  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • DownloadDownload
  • PrintPrint

14.1. Word Segmentation

Consider the Chinese text . This is the translation of the phrase "float like a butterfly." It consists of five characters, but there are no spaces between them, so a Chinese reader must perform the task of word segmentation: deciding where the word boundaries are. Readers of English don't normally perform this task, because we have spaces between words. However, some texts, such as URLs, don't have spaces, and sometimes writers make mistakes and leave a space out; how could a search engine or word processing program correct such a mistake?

Consider the English text "choosespain.com." This is a website hoping to convince you to choose Spain as a travel destination, but if you segment the name wrong, you get the less appealing name "chooses pain." Human readers are able to make the right choice by drawing upon years of experience; surely it would be an insurmountable task to encode that experience into a computer algorithm. Yet we can take a shortcut that works surprisingly well: look up each phrase in the bigram table. We see that "choose Spain" has a count of 3,210, whereas "chooses pain" does not appear in the table at all (which means it occurs fewer than 40 times in the trillion-word corpus). Thus "choose Spain" is at least 80 times more likely, and can be safely considered the right segmentation.


  

You are currently reading a PREVIEW of this book.

                                                                                        

Get instant access to over
$1 million worth of books and videos.

  

Start a Free Trial