Bitextor is a free and open source application whose objective is to generate translation memories using multilingual websites as a corpus source. It is licensed under GNU GPL v2. This application downloads all the HTML files in a website (given by the user). Then, it performs a preprocess to convert them to a coherent and suitable format and, finally, applies a set of heuristics (based mainly on HTML tag structure and text block length) to make pairs of files which are candidates to contain the same text in different languages. From these candidates, translation memories are generated in TMX format using the library LibTagAligner, which uses the HTML tags and the length of text chunks to perform the alignment. The objective of this tool is to provide an easy way to obtain a multilingual corpus obtained from the web. This application has been developer to make easier the process of training automatic translation tools (concretly, it has been developed to train Apertium application).
|