This module parses and dumps documents that are formatted more or
less according to RFC 4180, "Common Format and MIME Type for
Comma-Separated Values (CSV) Files",
http://www.rfc-editor.org/rfc/rfc4180.txt.
There are some issues with this RFC. I will describe what these
issues are and how I deal with them.
First, the RFC prescribes CRLF standard network line breaks, but
you are likely to run across CSV files with other line endings, so
we accept any sequence of CRs and LFs as a line break.
Second, there is an optional header line, but the format for the
header line is exactly like a regular record and you can only
figure out whether it exists from the mime type, which may not be
available. I ignore the issues of header lines and simply turn them
into regular records.
Third, there is an inconsistency, in that the formal grammar
specifies that fields can contain only certain US ASCII characters,
but the specification of the MIME type allows for other character
sets. I will allow all characters in fields, except for commas, CRs
and LFs in unquoted fields. This should make it possible to parse
CSV files in any encoding, but it allows for characters such as
tabs that the RFC may be interpreted to forbid even in non-US-ASCII
character sets.
NOTE: Several people have asked me to implement extensions that are
used in non-US versions Microsoft Excel. This library implements
RFC-compliant CSV, not Microsoft Excel CSV. If you want to write a
library that deals with the CSV-like formats used by non-US versions
of Excel or any other software, you should write a separate library. I
suggest you call it Text.SSV, for Something Separated Values.
|