6 Sep 2004 clmformat 1.004, 04-250
2. | ||
3. | ||
4. | ||
5. | ||
6. | ||
7. | ||
8. |
clmformat - display cluster results in readable form
(optionally with labels and/or cohesion and stickiness measures attached).
Unless used with the -dump option, clmformat depends on the presence of the macro processor zoem, as described further below.
clmformat -icl fname (input cluster file) -dump fname (write dump to file) [-tab fname (read tab file)]
The clustering in file fname should be in mcl matrix format - mcl generates clusters in this format. This will create a dump where each line contains a cluster in the form of tab-separated indices, or tab-separated labels in case the -tab option is used. This dump is easy to parse with a simple or even quick-and-dirty script.
clmformat -icl fname (input cluster file) -imx fname (input matrix/graph file) [-tab fname (read tab file)] [-dir dirname (write results to directory)] [-zmm fname (assume macro definitions are in fname)] [-fmt fname (write to encoding file fname)]
clmformat -icl fname (input cluster file) -imx fname (input matrix/graph file) [-tab fname (read tab file)] [-dir dirname (write results to directory)] [-zmm fname (assume macro definitions are in fname)] [-fmt fname (write to encoding file fname)] [-h (synopsis)] [-dump fname (write dump to file)] [--dump-pairs (write cluster/node pair per line)] [-dump-node-sep str (separate entries with str)] [-pi num (apply pre-inflation to matrix)] [-infix str (use after base name/directory)] [-lump-count n (node threshold)] [-nsm fname (output node stickiness file)] [-ccm fname (output cluster cohesion file)] [--split (prepare output for csplit usage)] [--adapt (allow domain mismatch)] [--version (print version)] [--subgraph (take subgraph with --adapt)]
clmformat generates a logical description of the to-be-formatted content in a very small vocabulary of clmformat-specific zoem macros. The appearance of the output can be easily changed by adapting a zoem macro definition file (also output by clmformat) that is used by the zoem interpreter to interpret the logical elements.
The output format is apt to change over subsequent releases, as a result of user feedback. Such changes will most likely be confined to the zoem macro definition file.
The OUTPUT EXPLAINED section further below is likely to be of interest.
The primary function of clmformat is to display cluster results and associated confidence measures in a readable form, by listing clusters in terms of the labels associated with the indices that are used in the mcl matrix. The labels must be stored in a so called tab file; see the -tab option for more information.
NOTE
clmformat output is in the form of zoem macros.
You need to have zoem installed in your system if you want clmformat
to be of use. Zoem will not be necessary if you are using
the -dump option.
The -imx mx option is required unless the -dump option is used. The latter option results in special behaviour described under the -dump fname entry.
Output is by default written in a directory that is newly created if it does not yet exist (normally several files will be created, for which the directory acts as a natural container). It is possible to simply output to the current directory, for that you need to specify -dir ./. If -dir is not specified, the output directory fmt.<clname> will be used, where <clname> is the argument to the -cl option. In the output directory, clmformat will normally write two files. One contains zoem macros encoding formatted output (the encoding file), and the second (the definition file) contains zoem macro definitions which are used by the former.
The encoding file is by default called fmt.azm (cf. the -fmt fname option). It contains zoem macros. It imports the macro definition file called clmformat.zmm that is normally also written by clmformat. Another macro definition file can be specified by using the -zmm <defsname> option. In this case clmformat will refrain from writing the definition file and replace mentions of clmformat.zmm in the encoding file by <defsname>.
The encoding file needs to be processed by issuing one of the following commands from within the directory where the file is located.
zoem -i fmt -d html zoem -i fmt -d txt
The first will result in HTML formatted output, the second in plain text format. Obviously, you need to have installed zoem (e.g. from http://micans.org/zoem/src/) for this to work.
For each cluster a paragraph is output. First comes a listing of other clusters (in order of relevance, possibly empty) for which a significant amount of edges exists between the other and the current cluster. Second comes a listing of the nodes in the current cluster. For each node a small sublist is made (in order of relevance, possibly empty) of other clusters in which the node has neighbours and for which the total sum of corresponding edge weights is significant. Several quantities are output for each node/cluster pair that is deemed relevant. These are explained in the section OUTPUT EXPLAINED.
Clusters will by default be output to file until the total node count has exceeded a threshold (refer to the -lump-count option).
clmformat also shows how well each node fits in the cluster it is in and how cohesive each cluster is, using simple but effective measures (described in section OUTPUT EXPLAINED). This enables you to compare the quality of the clusters in a clustering relative to each other, and may help in identifying both interesting areas and areas for which cluster structure is hard to find or perhaps absent.
What follows is an explanation of the output provided by the standard zoem macros. The output comes in a pretty terse number-packed format. The decision was made not to include headers and captions in the output in order to keep it readable. You might want to print out the following annotated examples. At the same side of the equation, the following is probably tough reading unless you have an actual example of clmformatted output at hand.
Below mention is made of the projection value for a node/cluster pair. This is simply the total amount of edge weights for that node in that cluster (corresponding to neighbours of the node in the cluster) relative to the overall amount of edge weights for that node (corresponding to all its neighbours). The coverage measure (refered to as cov) is also used. This is similar to the projection value, except that a) the coverage measure rewards the inclusion of large edge weights (and penalizes the inclusion of insignificant edge weights) and b) rewards node/cluster pairs for which the neighbour set of the node is very similar to the cluster. The maximum coverage measure (refered to as maxcov) is similar to the normal coverage measure except that it rewards inclusion of large edge weights even more. The cov and maxcov performance measures have several nice continuity and monotonicity properties and are described in [1].
Example cluster header
Cluster 0 sz 15 self 0.82 cov 0.43-0.26 10: 0.11 18: 0.05 12: 0.02
explanation
Cluster 0 sz 15 self 0.82 cov 0.43-0.26 | | | | | clid count proj cov covmax 10: 0.11 | | clidx1 projx1 18: 0.05 | | clidx2 projx2 clid Numeric cluster identifier (arbitrarily) assigned by MCL. count The size of cluster clid. proj Projection value for cluster clid [d]. cov Coverage measure for cluster clid [d]. maxcov Max-coverage measure for cluster clid [d]. clidx1 Index of other cluster sharing relatively many edges. projx1 Projection value for the clid/clidx1 pair of clusters [e]. clidx2 : projx2 : as clidx1 and projx1
Example inner node
An inner node is listed under a cluster, and it is simply a member of that
cluster. The name is as opposed to 'outer node', described below.
[foo bar zut] 21 7-5 0.73 0.420-0.331 0.282-0.047 0.071-0.035 <3.54> 10 6/3 0.16 0.071-0.047 0.268-0.442 12 4/2 0.11 0.071-0.035 0.296-0.515
explanation
[label] 21 7-5 0.73 0.420-0.331 0.282-0.047 0.071-0.035 <3.54> | | | | | | | | | | | idx nbi nbo proj cov covmax max_i min_i max_o-min_o SUM 10 6/3 0.16 0.268-0.442 0.071-0.047 | | | | | | | | clusid sz nb proj cov covmax max_i min_i label Optional; with -tab <tabfile> option. idx Numeric (mcl) identifier. nbi Count of the neighbours of node idx within its cluster. nbo Count of the neighbours of node idx outside its cluster. proj Projection value [a] of nbi edges. cov Skewed projection [b], rewards inclusion of large edge weights. covmax As cov above, rewarding large edge weights even more. max_i Largest edge weight in the nbi set, normalized [c]. min_i Smallest edge weight in the nbi set [c]. max_o Largest edge weight outside the nbi set [c] min_o Smallest edge weight outside the nbi set [c]. SUM The sum of all edges leaving node idx. clusid Index of other cluster that is relevant for node idx. sz Size of that cluster. nb Count of neighbours of node idx in cluster clusid. proj Projection value of edges from node idx to cluster clusid. cov Skewed projection of edges from node idx to cluster clusid. covmax Maximally skewed projection, as above. max_o Largest edge weight for node idx to cluster clusid [c]. min_o Smallest edge weight for node idx to cluster clusid [c].
Example outer node
An outer node is listed under a cluster. The node is not part of that cluster,
but seems to have substantial connections to that cluster.
[zoo eek few] 29 18#2 2-5 0.65 0.883-0.815 0.436-0.218 0.073-0.055 /4 0.27 0.070-0.109 0.073-0.055
explanation
[label] 29 18#2 2-5 0.65 0.883-0.815 0.436-0.218 0.073-0.055 | | | | | | | | | | | | idx cl sz nbi nbo proj cov maxcov max_i min_i max_o min_o id /4 0.27 0.070-0.109 0.073-0.055 <2.29> | | | | | | | nb proj cov maxcov max_i min_i SUM label Optional; with -tab <tabfile> option. idx Numeric (mcl) identifier clid Index of the cluster that node idx belongs to sz Size of the cluster that node idx belongs to proj : cov : All these entries are the same as described above covmax : for inner nodes, pertaining to cluster clid, max_i : i.e. the native cluster for node idx min_i : (it is a member of that cluster). max_o : min_o : nb The count of neighbours of node idx in the current cluster proj Projection value for node idx relative to current cluster. cov Skewed projection (rewards large edge weights), as above. covmax Maximally skewed projection, as above. max_o Largest edge weight for node idx in current cluster [c]. min_o smallest edge weight for node idx in current cluster [c]. SUM The sum of *all* edges leaving node idx.
[a] |
The projection value for a node relative to some subset of
its neighbours is the sum of edge weights of all edges to that
subset. The sum is witten as a fraction relative to the sum
of edge weights of all neighbours.
|
|
[b] |
cov and covmax stand for coverage and maximal coverage.
The coverage measure of a node/cluster pair is a generalized and skewed
projection value [a] that rewards the presence of large edge weights in the
cluster, relative to the collection of weights of all edges departing from
the node. The maxcov measure is a projection value skewed even further,
correspondingly rewarding the inclusion of large edge weights. The cov and
maxcov performance measures have several nice continuity properties and are
described in [1].
|
|
[c] |
All edge weights are written as the fraction of the sum
SUM of all edge weights of edges leaving node idx.
|
|
[d] |
For clusters the projection value and the coverage measures
are simply the averages of all projection values [a], respectively
coverage measures [b], taken over all nodes in the cluster.
The cluster projection value simply measures the sum of edge
weights internal to the cluster, relative to the total sum of
edge weights of all edges where at least one node in the edge
is part of the cluster.
|
|
[e] |
The projection value for start cluster x and end cluster y
is the sum of edge weights of edges between x and y as a fraction
of the sum of all edge weights of edges leaving x.
|
[1]
Stijn van Dongen. Performance criteria for graph clustering and Markov
cluster experiments. Technical Report INS-R0012, National Research
Institute for Mathematics and Computer Science in the Netherlands,
Amsterdam, May 2000.
http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z
mclfamily for an overview of all the documentation and the utilities in the mcl family.