Thereafter, using the transformation language XSLT, the information from within the XML document was to be extracted and presented differently depicting different scenarios. Oxygen XML Editor was used for the project.
3. Discussion A total of 7 webpages were taken from the Internet for the project. This section presents the reasons that motivated the various decisions that were taken during each stage of the project.
4. The Material Chosen The site is an informational site that lists down the companies falling in different application areas and where jobs can be hunted. The material chosen for the project were taken from the same site so there is a hierarchy in between the different web pages. Furthermore, the content on each page also follows a hierarchal structure that can be translated into relations. This would facilitate the markup language. The selected webpages covered the most commonly used elements of informational sites (i.e. headings, text, links, paragraphs, lists, etc.). This presented an opportunity of learning to encode the different elements into XML. Besides possessing a hierarchal structure of the selected pages, a repeating structure is also present in each page. This facilitates the marking up process. 5. The Document Analysis The first task was to analyze the documents and identify the manner and relation in which the data was presented in them. It was found that one page presented a list of companies categorized into their respective application areas. While the remaining 6 pages presented detail of 6 of these companies. So the relationship between the 7 documents was identified as shown in Figure 1. Figure 1 Tree Structure of Pages Within each of these pages, a pattern was found in the way the information was presented. Within the home page, there were categories, and list of companies in each category. Figure 2 Structure of Home Page In the remaining 6 pages, information regarding a company was presented under the related headings which exhibited a pattern. Some headings were common in all the 6 companies. Figure 3 Common Structure of About Pages Thus, the information from all the 7 pages was united and a tree structure was formed that represented the way portions of information were related to one another by the relationship of root, parent, child and siblings. Figure 4 Unified Tree Structure Once the document tree was identified, the document was marked up accordingly into an XML file (guide.xml) where the leaves of the tree were presented as child elements with no children. Figure shows the tree structure of the XML formed. Figure 5 XML Tree Structure Once the skeleton XML was formed, the data was filled in. 6. Encoding Scheme The next task was to validate the XML against an encoding scheme. This is important as it defines the rulings for the structure of XML that all inputs must adhere to. Any entry made that is not in accordance to the scheme set, makes the XML invalid and it will not remain well-formed. Two options were present to validate the defined XML document against; XML DTD or XML Schema. Although both are standardized (so developers can understand them equally easily) and both the options deliver the same functionality yet there is a difference in their definition. DTD has the lowest definition of data as CDATA