Preprocessing of textual data


There is a variety of tools dealing with tokenization. However, in respect to SMEs (business domain), the weakness of them is in identifying prices as tokens. For example, "200$" is usually identified as two separate tokens.

Figure below illustrates the exemplary output of preprocessing involving tokenizer (with identifying prices as one token) and filtering (i.e. stopwords removal). The original text sample is "While a student at the University of Texas at Austin in 1984, Michael Dell founded the company as PC's Limited with capital of $1000. In 1985, the company produced the rst computer of its own design, the Turbo PC, sold for US$795." (taken from



Filtering ("stopwords" removal) 

Examples of "stopwords": english, german, spanish, italian, french