Quoting from a document by Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk> that can be found at: http://www.cl.cam.ac.uk/mgk25/unicode.html. It's a very good document, and you should read it.
What is UNICODE?
Historically, there have been two independent attempts to create a single unified character set. One was the ISO 10646 project of the International Organization for Standardization (ISO), the other was the Unicode Project organized by a consortium of (initially mostly US) manufacturers of multi-lingual software. Fortunately, the participants of both projects realized around 1991 that two different unified character sets is not what the world needs. They joined their efforts and worked together on creating a single code table. Both projects still exist and publish their respective standards independently, however the Unicode Consortium and ISO/IEC JTC1/SC2 have agreed to keep the code tables of the Unicode and ISO 10646 standards compatible and they closely coordinate any further extensions. Unicode 1.1 corresponds to ISO 10646-1:1993 and Unicode 3.0 corresponds to ISO 10646-1:2000.In GOCR, we adopted the Unicode Standard version 3.0. To the programmer using GOCR, this is a very simple way to deal with characters that are not in the ASCII or the ISO-8859-1 table, and let one to support any language.
Support in GOCR is very simple, as it should be. There's a list #defining some of the characters in unicode.h. Note that only a small portion of the Unicode set is present there, which reflect what we hope to be able to recognize in the near future, and what we already do. If you need to support other characters not found there, please feel free to. Be sure to use their correct codes; you can get a full list of them in:
In short, GOCR uses UCS-4 encoding internally. This is much easier to handle by the programmer than UTF-8 encoding, and should not pose problems provided that you use wcs* functions instead of the usual str* functions. The OutputFormatter module can be used to export UTF-8 text or whatever you need.
The wchar_t type is used to handle wide characters. If needed, we assume that wchar_t is 32 bits long, which is the default these days, but a 16-bit wchar_t may work if you don't use characters whose code is larger than 0xFFFF (65535).
GOCR provides a simple function that helps to compose characters and accents:
Modifier | Characters |
ACUTE_ACCENT | aeiouy AEIOUY |
CEDILLA | c C |
TILDE | ano ANO |
GRAVE_ACCENT | aeiou AEIOU |
DIAERESIS | aeiouy AEIOUY |
CIRCUMFLEX_ACCENT | aeiou AEIOU |
RING_ABOVE | a A |
e or E ( æ, ![]() |
ao AO |
Besides that, it also supports a latingreek character
translation, if you pass 'g' as modifier. See the table for
reference.
|
If main is a capital letter, the returning characters will also be capital letters. Support of greek accents (tonos, dialytika, etc) is under way.