HTML Scanning

This section explains how Globalyzer scans HTML source and will instruct you in the process of customizing the Included HTML Tags and the Ignored HTML Tags within an HTML Rule Set. Tag configuration tells Globalyzer how to detect display text within your HTML source - regardless of whether the HTML is embedded within a static HTML file or a jsp, asp or aspx file.

Scanning HTML Code
Configuring Included HTML Tags
Configuring Ignored HTML Tags

Scanning HTML Code

You can configure a Globalyzer HTML Rule Set to scan HTML code that exists within any web file. These include:

Static HTML files (.htm or .html)
Java Server Pages (.jsp)
Active Server Pages (.asp)
ASP.NET (.aspx)
Any other file type that contains chunks of HTML code can be specified by adding it the Rule Set's File Extensions list.

Even if the file type you are scanning is one of those that Globalyzer has been tested to support, you will probably get better results if you customize the HTML Rule Set to fit your application-specific HTML coding style and any non-standard tags. Before you begin customizing a Rule Set, it will help you to understand how Globalyzer scans HTML code. More specifically, you should understand how Globalyzer detects display strings in static HTML blocks.

Scanning HTML Blocks for Display Text

Your purpose in using Globalyzer to scan blocks of HTML code likely centers around your desire to detect and perhaps externalize to resource files any display strings embedded in that code. You might also want to find the paths to image files, but this can be done with no special tweaking - just make sure all of the image file types you use in your application are included in the Static File References category in your Rule Set.

If you are scanning static HTML files, you might simply need to quantify the number of strings that will require translation within a body of static HTML. You might be considering conversion of the static HTML pages to jsp, asp, or aspx files so you can place the display text in resource files and dynamically retrieve the strings at runtime from those files according to the locale of the user.

You may be scanning jsp, asp or aspx files that contain blocks of HTML with embedded display text. Regardless of your purpose and regardless of the file type you are scanning, an HTML Rule Set is only concerned about the static blocks of HTML code within your source files. If your files include other in-line code that also contains display text, you will need to create a separate Rule Set for the appropriate programming language and scan the file separately with that Rule Set to detect the embedded strings.

In other words, let's say you have a .jsp file that includes a combination of server-side Java code, static HTML code, and JavaScript - all containing display text. To scan - and later externalize - all of the display text embedded in this code sample, you will need to create a Java Rule Set, a JavaScript Rule Set and an HTML Rule Set. Then you will scan your source code with each Rule Set you created. The reason for this is that each programming language is governed by a separate set of rules dictating what defines a "string" or a piece of text that may be displayed to a user.

When you scan the above sample of code for embedded strings using a default HTML Rule Set (no customization) you will see the following in the Embedded Strings Scan Results:

How does Globalyzer determine which of the strings embedded in the HTML blocks are displayed in the browser? The basics of Globalyzer's HTML string detection can be summed up in four short rules:

Globalyzer looks for text embedded between an opening and a closing tag that match. In the example code, this includes the opening and closing td tags containing the text Occupation:

Note: Globalyzer's tag-matching requirements are strict and they are case sensitive. If in the above example the td in the opening tag was lower case and the TD in the closing tag was upper case, Globalyzer would not detected the text in between them.

There can be no other tags in between the matching opening and closing tags. This is why Globalyzer doesn't just grab everything between the form tags or everything between the body tags.

As an amendment to the previous rule, Globalyzer is by default configured to "include" certain style tags and in-line code tags. This is why, despite rule # 2, Globalyzer did detect the Welcome back text and left the bold tags and the in-line Java code with it.

Globalyzer is by default configured to "ignore" certain tags, such as code. Globalyzer will not find text between ignored tags; it is assumed that such text will not be displayed to an end user and so will not need to be detected.

Configuring Included HTML Tags

In the previous section we explained that Globalyzer looks for the inner-most matching tags in the static HTML and grabs any text in between those. The exception to this rule lies in the fact that Globalyzer is configured to include certain tags, and hence even a default HTML Rule Set will grab the following text as a single embedded text block:

That Globalyzer grabs this entire block of text at once is important for several reasons. First, the entire block should be externalized intact because the translator needs to be able to combine the words in the order that suits the user's locale. Granted, the translator won't literally have the user's name - as this is a variable populated at runtime. In this case, the person internationalizing the code will need to do more than simply externalize this piece of display text.

For the other two strings detected in the sample HTML, the user could externalize the strings to a resource file as they are. With this string, the developer needs to do a little bit of refactoring before the string can be externalized. This includes calling Java's MessageFormat object, which will allow the translator to not only translate the translatable elements of this phrase, but also to specify the order in which they should be displayed on the page. Similar classes exist within other programming languages such as Microsoft's .NET.

Let's now focus on the fact that Globalyzer included the bold tags and the in-line Java in line 28 of our sample .jsp file. Globalyzer will automatically include style tags such as bold tags and italics tags. During the Rule Set creation and editing processes, you can view, edit, delete from and add to the list of tags that Globalyzer will include.

By included, we mean that Globalyzer sees these tags as plain text. The default settings in this category could be sufficient. But if you are using a third-party tag library that includes non-standard tags, some of which you want Globalyzer to include, these can be added to the list.

On the other hand, you may consistently delineate cohesive blocks of text with a tag that Globalyzer is by default configured to include - such as the font tag. In this case, you would simply uncheck that tag in the list of included tags, and Globalyzer would begin to detect all text between font tags.

Configuring Ignored HTML Tags

Globalyzer also has a list of Ignored HTML Tags. Ignored tags and all text in between are removed prior to scanning. The default settings should be sufficient, but you can view, edit, delete from and add to the list of tags that Globalyzer will ignore.

Note: If a tag is configured to be on both the Included list and the Ignored list, it will be ignored.

Configuring Rule Sets

User's Guide

HTML Scanning

Scanning HTML Code

Scanning HTML Blocks for Display Text

Configuring Included HTML Tags

Configuring Ignored HTML Tags