1) Basic Rules:
a) Any traditional character (letters, numbers, and symbols) counts as a single VISIBLE CHARACTER
b) A space or new-line counts as a single NON-VISIBLE CHARACTER
c) An HTML tag counts as single NON-VISIBLE CHARACTER
d) A traditional character count is the sum of VISIBLE and NON-VISIBLE characters (when we simply say "character count" we usually mean this total sum)
e) A "visible character count" is the sum of VISIBLE characters
f) A "65-character" or "AAMT" line count simply takes the total number of characters in the entire document and divides by 65 (rounded up) to count total lines
g) A "line count" usually implies a "Basic Line Count", which sums, separately for each line, the total number of characters divided by 65 (rounded up)
h) A "template character count" is simply the number of VISIBLE and NON-VISIBLE characters that appear inside template sections
i) A "template visible character count" is the number of VISIBLE characters that appear inside template sections
2) All consecutive whitespace (tabs, spaces, new lines) is combined into a single character, even if intentionally inserted by the transcriptionist
a) If there is a new line, the consecutive whitespace is reduced to a single new-line
b) Else it is reduced with a single space
3) Whitespace at the beginning and end of the document is insignificant and not counted
4) All HTML tags are REMOVED from the document, except ones that Emdat considers significant:
a) Formatting Tags: Bold, Italic, Underline, Font Changes, etc.
b) Structural Tags: Line-Breaks, Paragraphs, Tables, Lists, Page-Break, Horizontal Rule
c) Special Markers: i.e. Template Sections
5) Each remaining HTML tag is counted as a NON-VISIBLE CHARACTER
a) Any formatting tag counts as TWO characters, one for enabling the formatting, and again for disabling it (i.e. BOLD is always followed by an UNBOLD)
b) Some tags only count as ONE character: page/line breaks, horizontal rule
6) Newlines within TABLES are INSIGNIFICANT:
a) Therefore they DO NOT increase the number of Basic Lines in a transcription
b) They are treated as a single space for the purposes of counting
c) Imagine that we count tables by copying text from all the cells and putting it on a single line
And two rules that no one except a developer should care about, but are included here for completeness...
7) Empty HTML tags are eliminated from the transcription
- Say if the transcriptionist deletes a bolded word, sometimes the editor leaves the BOLD/UNBOLD commands sequentially in the document. This rule just says that if there is no VISIBLE text to format, the formatting tags are removed so as not to be counted. Same as if a TABLE has no content in any of the cells, the entire table is NOT COUNTED.
8) Whitespace is "pushed out" of HTML tags
- Say if the MT hits space, BOLD, then another space, then types a word.
- Since a non-visible is directly inside the BOLD formatting style, we "push it out" next to the other space so that the BOLD is always followed by a visible char.
- This is important because now there are two consecutive non-visible characters, which can now be collapsed in a single space for counting.
