next up previous contents index
Next: Setting characters Up: charRecognizer Previous: charRecognizer   Contents   Index


Using UNICODE©

Quoting from a document by Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk> that can be found at: http://www.cl.cam.ac.uk/mgk25/unicode.html. It's a very good document, and you should read it.

What is UNICODE?

Historically, there have been two independent attempts to create a single unified character set. One was the ISO 10646 project of the International Organization for Standardization (ISO), the other was the Unicode Project organized by a consortium of (initially mostly US) manufacturers of multi-lingual software. Fortunately, the participants of both projects realized around 1991 that two different unified character sets is not what the world needs. They joined their efforts and worked together on creating a single code table. Both projects still exist and publish their respective standards independently, however the Unicode Consortium and ISO/IEC JTC1/SC2 have agreed to keep the code tables of the Unicode and ISO 10646 standards compatible and they closely coordinate any further extensions. Unicode 1.1 corresponds to ISO 10646-1:1993 and Unicode 3.0 corresponds to ISO 10646-1:2000.
In GOCR, we adopted the Unicode Standard version 3.0. To the programmer using GOCR, this is a very simple way to deal with characters that are not in the ASCII or the ISO-8859-1 table, and let one to support any language.

Support in GOCR is very simple, as it should be. There's a list #defining some of the characters in unicode.h. Note that only a small portion of the Unicode set is present there, which reflect what we hope to be able to recognize in the near future, and what we already do. If you need to support other characters not found there, please feel free to. Be sure to use their correct codes; you can get a full list of them in:

http://www.unicode.org
and if you notify us, we add them to the header. As GOCR treats the codes as simple numbers, it doesn't matter if it's in the header or not. The only problem you may find is with the outputFormatter plugin, which may not support some characters.

In short, GOCR uses UCS-4 encoding internally. This is much easier to handle by the programmer than UTF-8 encoding, and should not pose problems provided that you use wcs* functions instead of the usual str* functions. The OutputFormatter module can be used to export UTF-8 text or whatever you need.

The wchar_t type is used to handle wide characters. If needed, we assume that wchar_t is 32 bits long, which is the default these days, but a 16-bit wchar_t may work if you don't use characters whose code is larger than 0xFFFF (65535).

GOCR provides a simple function that helps to compose characters and accents:

wchar_t gocr_compose ( wchar_t main, wchar_t modifier );
Now the arguments: main is the character, and modifier is the accent; the function returns the code of the accented character. Example:

character = gocr_compose( a, ACUTE_ACCENT );
returns the code of the character á. Currently this function supports the following:



Modifier Characters
ACUTE_ACCENT aeiouy AEIOUY
CEDILLA c C
TILDE ano ANO
GRAVE_ACCENT aeiou AEIOU
DIAERESIS aeiouy AEIOUY
CIRCUMFLEX_ACCENT aeiou AEIOU
RING_ABOVE a A
e or E ( æ, \oe) ao AO



Besides that, it also supports a latin\( \rightarrow \)greek character translation, if you pass 'g' as modifier. See the table for reference.

Table 3.1: Latin \( \rightarrow \)greek reference for gocr_compose.
Latin Greek
a \( \alpha \)
b \( \beta \)
g \( \gamma \)
d \( \delta \)
e \( \epsilon \)
z \( \zeta \)
h \( \eta \)
q \( \theta \)
i \( \iota \)
k \( \kappa \)
l \( \lambda \)
m \( \mu \)
n \( \nu \)
x \( \xi \)
o o
p \( \pi \)
r \( \rho \)
#6493#>  
s \( \sigma \)
t \( \tau \)
y \( \upsilon \)
f \( \phi \)
c \( \chi \)
v \( \psi \)
w \( \omega \)


If main is a capital letter, the returning characters will also be capital letters. Support of greek accents (tonos, dialytika, etc) is under way.


next up previous contents index
Next: Setting characters Up: charRecognizer Previous: charRecognizer   Contents   Index
root 2002-02-17