Option C: Web science

Objectives
References

C.1 Creating the Web

Command Term Level Definition
Describe 2 Give a detailed account.
Distinguish 2 Make clear the differences between two or more concepts or items
Outline 2 Give a brief account or summary.
Evaluate 3 Make an appraisal by weighing up the strengths and limitations
Explain 3 Give a detailed account including reasons or causes.
Identify 2 Provide an answer from a number of possibilities.

C.1.1Distinguish between the internet and World Wide Web (web).
*internet is the infrastructure which enables computers, servers and other devices to establish communication by means of cables and satellite connection.
*world wide web is the uses the internet to access data and enable data exchange between users all over the globe. some of the applications of the wed include: wed pages and email.
C.1.2Describe how the web is constantly evolving.
*the web was first a platform of data exchange with a limited number of users with applications like : online libraries at universities.
*then commercial applications were added to the web like: online shopping.
*with web 2.0, users demand for social aspects have been added such as : social platforms like Facebook or Myspace, semantic webs (helps computers understand the meaning behind the webpages and the interaction between the computer and the users).
* due to the progress of technology and availability of high speed internet, mobile devices and connected things such as fridges, houses, cars will play a bigger role
C.1.3Identify the characteristics of the following:
  • hypertext transfer protocol (HTTP)
    hypertext transfer protocol (HTTP): is a protocol that describes the data exchange in the world wide web.(which port to uses and how the data should be formatted)
    Port 80 is the standard port for HTTP though other ports can be used
  • hypertext transfer protocol secure (HTTPS)
    hypertext transfer protocol secure (HTTPS) : its the same like http but is extended with a security component that encrypts the data exchange between sender and receiver.
    Port 443 is the standard port for HTTPS though other ports can be used
  • hypertext mark-up language (HTML)
    hypertext mark-up language (HTML): is the standards for formatting content that is to be displayed in computer browsers.
  • uniform resource locator (URL)
    Uniform Resource Locator (URL) : is the address of a webpage that is usually easy to remember. it consists at least of a second level domain such as "Facebook" and a top level domain such as .com, .de
  • extensible mark-up language (XML)
    Extensible mark-up language (XML) : is a tag-based syntax which is used to structure and describe information
  • extensible stylesheet language transformations (XSLT)
    Extensible style sheet language transformations (XSLT) : is a programming language which transform XML documents into different output formats which are required by browsers such as google chrome and internet explorer.
  • JavaScript.
    JavaScript: is a programming language is commonly used for web applications
    We used this during the first year of IBCS to validate form elements, among other things.
  • cascading style sheet (CSS).
    cascading style sheet (CSS): is the central source for formatting instructions of content and layout of a webpage.
Activities Covering the Content
hypertext transfer protocol (HTTP)
hypertext transfer protocol secure (HTTPS)
Outline the principal difference between HTML and HTTP. HTML is a programming/scripting/markup language;
HTTP is a protocol/standard;
hypertext mark-up language (HTML)
uniform resource locator (URL)
extensible mark-up language (XML)
Identify one characteristic of XML.

It does not contain a fixed set of tags, therefore new ones can be added; [1 mark]

extensible stylesheet language transformations (XSLT)  
JavaScript (see year 1 JavaScript coverage)
cascading style sheet (CSS)
  • See agenda items for Wednesday November 9, 2016, include examples and lesson instructions for creating html pages in Cloud 9 that use CSS
  • Reference: Adding CSS to HTML
  • Open-note CSS Quiz
  • Open-note CSS Quiz - Key
C.1.4Identify the characteristics of the following:
  • uniform resource identifier (URI)
  • URL.
C.1.5Describe the purpose of a URL.
C.1.6Describe how a domain name server functions.
C.1.7Identify the characteristics of:
  • internet protocol (IP)
  • transmission control protocol (TCP)
  • file transfer protocol (FTP).
C.1.8Outline the different components of a web page.
C.1.9Explain the importance of protocols and standards on the web.
A protocol is a set of rules and procedures that both sender and receiver must adhere to in order to allow coherent data transfer; without protocols a lossless data transfer can not be established.

Standards such as html allow interoperability between different systems and components.

C.1.10Describe the different types of web page.
personal pages, blogs, search engine pages, forums, social media platforms, newspages, media sources, trading pages, customer service platforms, information pages of authorities
C.1.11Explain the differences between a static web page and a dynamic web page.
*static HTML web pages are remaining with the same content and layout until the webdesigner is changing them. *dynamic web pages, that make us of PHP, ASP.NET, Java Servlets change their appearance and content depending on user input.
C.1.12Explain the functions of a browser.
A web browser (commonly referred to as a browser) is a software application for retrieving, presenting, and traversing information resources on the World Wide Web.
C.1.13Evaluate the use of client-side scripting and server-side scripting in web pages.
A client-side script will not require access to a remote server so that any processing that is done will be done more quickly and use less bandwidth; This will reduce the load on the server;
C.1.14Describe how web pages can be connected to underlying data sources.
A webpage can be connected to a database server (for example a SQL-Server), from which the webserver can retrieve information that is to be displayed to the user. In IBCS Year one we created a series of C programs that could read and write a series of files as part of a grade book project. A web fron end was used to connect to a web server, and php programs executed by the web server would in turn call the C programs that could access the grade book files.
C.1.15Describe the function of the common gateway interface (CGI).
CGI is making executable programs that are installed on a server available to a client.
Perl was one of the first programming languages to be used for CGI programming. Web servers could execute Perl programs on the server and direct the program output back to user. Perl can connect to database creating a "gateway" to data sources.
C.1.16Evaluate the structure of different types of web pages.

C.2 Searching the Web

Command Term Level Definition
Define 1 Give the precise meaning of a word, phrase, concept or physical quantity.
Describe 2 Give a detailed account.
Distinguish 2 Make clear the differences between two or more concepts or items
Outline 2 Give a brief account or summary.
Discuss 3 Offer a considered and balanced review that includes a range of arguments, factors or hypotheses. Opinions or conclusions should be presented clearly and supported by appropriate evidence
Explain 3 Give a detailed account including reasons or causes.
Suggest 3 Propose a solution, hypothesis or other possible answer.

Vocabulary for Searching the Web

C.2.1Define the term search engine.
A web search engine is a software system that is designed to search for information on the World Wide Web
C.2.2Distinguish between the surface web and the deep web.
  • The Surface Web is that portion of the World Wide Web that is readily available to the general public and searchable with standard web search engines.
  • The deep web are parts of the World Wide Web whose contents are not indexed by standard search engines for any reason. The deep web is opposite to the surface web.
    It is much larger than the surface web. Only a fraction of the data on the web is accessible by conventional means.
      The deep web includes:
    • dynamically generated pages (as a result of queries /produced by JavaScript / downloaded from servers using AJAX/Flash)
    • pass-word protected pages (and subscriptions)
    • pages without any inlinks
C.2.3Outline the principles of searching algorithms used by search engines.
    Google's PageRank algorithm
  • each page is given a score (rank) for a particular search
  • the score determines how high up the list the page will appear
  • score primarily determined by the number(and importance ) of inlinks
  • the value of an inlink from Page A is proportional to P(A)/C(A) where P(A) is the PageRank of Page A, and C(A) are the number of outlinks from Page A
  • values are calculated when pages are indexed

    Google's ranking also includes other factors such as:
  • the time that the page has existed
  • the frequency of the search keywords on the page
  • other unknown factors (the exact algorithm is not made public by Google)
  • Link analysis algorithm that assigns numerical weighting to each element of hyperlinked texts. PR(E) (page rank of E). A hyperlink to a page counts as a vote or support of a particular page. Importance by association. Number of paths to the page divided by number of outgoing links from the page/step before and then considering the PR of the previous page/step. Altogether, the different PageRanks would sum 1, its a probability distribution.
    Ref: wikibooks.org: IB CS Web_Science

    HITS algorithm
  • has been superseded by PageRank
  • based upon hubs and authorities
  • a hub is a page that leads to many authoritative pages
  • an authority is a page that is linked to by many hubs
  • page ranking determined by the sum of the hub score and the authority score
  • authority score is the sum of the hub scores of each node pointing to it
  • hub score is the sum of authority scores of every node that it points to
  • The HITS algorithm is an iterative process that is executed at query time (therefore relatively slow)
  • a link analysis program that also rates Web pages. Hubs and authorities. A good hub points to many pages, a good authority is a page linked to by many hubs. Each page is assigned two scores: its authority, which estimates value of content, and its hub value, which estimates the value of its links to other pages. First generates a root set (most relevant pages) through text-based algorithm. Then a base set generated by augmenting the root set with web pages linked from it or to it. The base set and all the hyperlinks in the base set form a focused subgraph upon which HITS is performed.
    Ref: wikibooks.org: IB CS Web_Science

    Things to also know and understand
  • the consequential effects that a change in PageRank of one page will have on others and that the calculation of PageRanks is an iterative process.

    Things that you will not be asked on an IBCS Paper
  • mathematical examples - you will encounter math concepts in the algorithm reading assigned, and when we talk about graph theory.

C.2.4Describe how a web crawler functions.
A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose of Web indexing. The "spider" checks for the standard filename robots.txt, addressed to it, before sending certain information back to be indexed depending on many factors, such as the titles, page content, JavaScript, Cascading Style Sheets (CSS), headings, as evidenced by the standard HTML markup of the informational content, or its metadata in HTML meta tags.
    Web Crawlers:
  • creates acopy of every web page (for later indexing by the search engine) that it visits
  • usually starts at a popular site
  • searches a page for links to other pages
  • follows these links and repeats process
  • initially looks for the file robots.txt for instructions on pages to ignore (duplicate content, irrelevant pages)
  • also used to retrieve email addresses (for spam)
  • also used by webmaster for checking integrity of site (it can find links that are no longer valid or files that are missing)

    Activities
  • Synonyms: web robots, bots, web spiders
  • Introduction to web crawlers.
    Class Note form and example using wget to download java files.
      Wget notes
    • can access sites using http, https, and ftp protocols
    • supports connecting using a userid and password
    • supports identifying itself as a particular browser, which is useful for download browser specific versions of a web site/web page (for example, a firefox or Internet Explorer version)
    • is available from the Cloud9 environment we are using.
C.2.5Discuss the relationship between data in a meta-tag and how it is accessed by a web crawler.
Google says: "Currently we don't trust metadata because we are afraid of being manipulated". So meta tags can only be one source of information to index a web site. But mostly content based algorithms are being utilized by modern web crawlers.
    Some spiders pay more attention to words occurring in
  • titles
  • sub-titles
  • metatags
    while other spiders/index may index every one found on a page
      Meta tags
    • are inserted by web designer/owner
    • contain keywords and concepts (helps to clarify meaning)
    • description / title can be shown in the search results
    • noindex, nofollowin 'robots' tag can instruct crawlers not to index pages

      Things to also remember
    • keywords can be misleading

C.2.6Discuss the use of parallel web crawling.
The expansion of the web has led to new search engine initiatives which include parallelization of web crawlers.
    Parallel web crawlersare designed to:
  • maximize performance
  • minimise overheads
  • avoid duplication
  • communicate with each other (to avoid above)
  • can work different geographical areas
C.2.7Outline the purpose of web-indexing in search engines.
  • Web indexing allows to quickly give the user search results based on the webpages meta data, content or other sources.
  • Web-crawlers retrieve copies of each web page visited
  • Each page is inspected to determine its ranking for specific search terms.
C.2.8Suggest how web developers can create pages that appear more prominently in search engine results.
  • use good keywords in the content
  • make the page be linked by many source pages
  • Allow search engines to find your site - submit your web site for indexing to the search engines, make sure search engines have authorization to reach the pages you would like indexed
  • set the robots.txt file appropriately
  • Have a link-worthy site - so other web sites will link to yours, making it more relevent and increase your page rank
  • Identify key words, metadata
  • Ensure search-friendly architecture
  • Have quality content - and don't leave it stagnate. Updating the content regularly will make it more timely.
  • Remove outdated material
  • See also C.2.11
C.2.9Describe the different metrics used by search engines.
  • Keyword rankings
  • Backlinks
  • Organic search traffic
  • Average time on-page
  • Pages per visitor
  • Trustworthiness of linking domain/hub
  • Popularity of linking page
  • Relevancy of content between source and target page
  • Anchor text used in link
  • Amount of links to the same page on source page
  • Amount of domains linking to target page
  • Relationship between source and target domains
  • Variations of anchor text in link to target page
    How do different search engines compare? Parameters to look at include:
  • recall (finding the relevant page in an index)
  • precision (ranking a page correctly)
  • relevance
  • coverage
  • customization
  • user experience
C.2.10Explain why the effectiveness of a search engine is determined by the assumptions made when developing it.
This is a topic worth thinking about. Think about the vocabulary for the topic "searching the web" and how that vocabulary applies to answering this question. Here are some things to consider:
  • What is a search engine?
  • How does a search engine work?
  • If you were creating a search engine, what are the assumptions that you would make?
  • Who are the targeted users of search engines?
  • Who are the content providers for search engines?
  • Do all content providers follow the "rules?" What are the rules?
  • As a search engine designer, what "rules" would you want to your content providers to follow?
  • What would you do with content that doesn't follow the rules?
  • You need not write an essay (because the command term is Explain, but connect some of these ideas while addressing the question directly. Read the question again carefully and pick some key concepts to include in your answer.
C.2.11Discuss the use of white hat and black hat search engine optimization.
    Things to know
  • The difference betwwen white hat search engine optimization and black hat
  • The degree of success achieved by either white hat or black hat optimization efforts

    White hat (links from C.2.8)
  • new sites can send XML site map to Google
  • include a robots.txt file
  • add site to Google's Webmaster Tools to warn you if site is uncrawlable
  • make sure the HI tag contains your main keyword
  • page titles contain keywords
  • relevant keywords with each image
  • site has suitable keyword density (but no keyword stuffing)
  • White hat techniques are "within" guidelines and considered ethical - long term return. Guest blogging, Link baiting, Quality content, Site optimization,
  • "In search engine optimization (SEO) terminology, white hat SEO refers to the usage of optimization strategies, techniques and tactics that focus on a human audience opposed to search engines and completely follows search engine rules and policies.

    For example, a website that is optimized for search engines, yet focuses on relevancy and organic ranking is considered to be optimized using White Hat SEO practices. Some examples of White Hat SEO techniques include using keywords and keyword analysis, backlinking, link building to improve link popularity, and writing content for human readers.

    White Hat SEO is more frequently used by those who intend to make a long-term investment on their website. Also called Ethical SEO."
    Ref: White Hat SEO

    Black-hat
  • hidden content
  • keyword stuffing
  • link farms
  • other tricks to get page rankings higher than they should be, or to get pages marked as hits when they may have nothing to do with a particular search.
  • Black hat use aggressive SEO strategies that exploit search engines rather than focusing on human audience - short term return. Include usage of: Blog spamming, Parasite hosting, Cloaking

White hat search engine optimization is being performed when filling the webpage with relevant data.
Hence black hat search engine optimization stuffs the webpage with key words which give barely sense. In order to give a good product to the user, good content of the webpage should be the main tool to get a high search engine ranking.

C.2.12Outline future challenges to search engines as the web continues to grow.
Issues such as error management, lack of quality assurance of information uploaded. Since the number of webpages and the number of authors increase rapidly, it is getting more and more important for search engines to filter the information the user wants. Due to the larger amount of data in the world wide web, the crawlers have to be designed more efficiently.
    Areas being developed are:
  • concept-based searching
  • natural language queries (e.g Ask.Jeeves.com)
Review Materials Outline of things to know ReviewTopics-C-2-Searching-The-Web.pdf

Classwork/Homework Searching the Web Review Questions

C.3 Distributed Approaches to the Web

Command Term Level Definition
Define 1 Give the precise meaning of a word, phrase, concept or physical quantity.
Describe 2 Give a detailed account.
Distinguish 2 Make clear the differences between two or more concepts or items
Compare 3 Give an account of the similarities between two (or more) items or situations, referring to both (all) of them throughout.
Evaluate 3 Make an appraisal by weighing up the strengths and limitations
Explain 3 Give a detailed account including reasons or causes.

C.3.1Define the terms: mobile computing, ubiquitous computing, peer-2-peer network, grid computing.
C.3.2Compare the major features of:
  • mobile computing
  • ubiquitous computing
  • peer-2-peer network
    Explain one advantage of the use of a peer-2-peer (P2P) network for obtaining and downloading music and movie files.
    • Easier to set up; Less time will need to be spent in configuring the network;
    • Other advantages could deal with the increased range of available files and the lower (or even zero) costs involved (depending upon the network).
  • grid computing.
C.3.3Distinguish between interoperability and open standards.
C.3.44 Describe the range of hardware used by distributed networks.
C.3.55 Explain why distributed systems may act as a catalyst to a greater decentralization of the web.
C.3.6Distinguish between lossless and lossy compression.
Discuss two factors that would affect the decision to use either lossless or lossy compression when transferring files across the Internet.

  • Lossless compression is used when loss of data is unacceptable when transferring files such as audio files;
  • Lossy compression may not significantly affect the final version of the file when it is decompressed;
  • Lossy compression will reduce file size;
  • Reduced file size may be an important requirement such as in the use of MP3 music files;
  • Lossy compression results in faster file transfer; Which is important when Internet connections are slow or files are large;
  • If lossy compression is used the original file cannot be reinstated;
C.3.7Evaluate the use of decompression software in the transfer of information.

C.4 The Evolving Web

Command Term Level Definition
Describe 2 Give a detailed account.
Discuss 3 Offer a considered and balanced review that includes a range of arguments, factors or hypotheses. Opinions or conclusions should be presented clearly and supported by appropriate evidence
Explain 3 Give a detailed account including reasons or causes.

C.4.1Discuss how the web has supported new methods of online interaction such as social networking.
C.4.2Describe how cloud computing is different from a client-server architecture.
Define the term Private Cloud:
Cloud computing services that are provided for a particular group with a limited number of users;
C.4.3Discuss the effects of the use of cloud computing for specified organizations.
C.4.4Discuss the management of issues such as copyright and intellectual property on the web.
C.4.5Describe the interrelationship between privacy, identification and authentication.
C.4.6Describe the role of network architecture, protocols and standards in the future development of the web.
C.4.7Explain why the web may be creating unregulated monopolies.
C.4.8Discuss the effects of a decentralized and democratic web.

HL Extension C.5 Analysing the Web

Command Term Level Definition
Describe 2 Give a detailed account.
Outline 2 Give a brief account or summary.
Discuss 3 Offer a considered and balanced review that includes a range of arguments, factors or hypotheses. Opinions or conclusions should be presented clearly and supported by appropriate evidence
Explain 3 Give a detailed account including reasons or causes.

C.5.1Describe how the web can be represented as a directed graph.
C.5.2Outline the difference between the web graph and sub-graphs.
C.5.3Describe the main features of the web graph such as bowtie structure, strongly connected core (SCC), diameter.
C.5.4Explain the role of graph theory in determining the connectivity of the web.
C.5.5Explain that search engines and web crawling use the web graph to access information. <
C.5.6Discuss whether power laws are appropriate to predict the development of the web.

HL Extension C.6 The Intellegent Web

Command Term Level Definition
Define 1 Give the precise meaning of a word, phrase, concept or physical quantity.
Describe 2 Give a detailed account.
Distinguish 2 Make clear the differences between two or more concepts or items
Discuss 3 Offer a considered and balanced review that includes a range of arguments, factors or hypotheses. Opinions or conclusions should be presented clearly and supported by appropriate evidence
Evaluate 3 Make an appraisal by weighing up the strengths and limitations
Explain 3 Give a detailed account including reasons or causes.

C.6.1Define the term semantic web.
A "web of data" that can be read and analysed by machines. Students should appreciate the difference between this and a "web of documents" which would describe the present state of the web (pre-Semantic Web).
C.6.2Distinguish between the text-web and the multimedia-web.
This is part of the evolution of the web.

Describe some of the tools that allow the use of multimedia.

C.6.3Describe the aims of the semantic web.
  • The web should become the ultimate (machine-readable) database.
  • The facility to link data across different enterprises.
  • The web should become a highly collaborative medium.
  • Common vocabularies and methods for handling and querying data need to be developed and agreed upon.
  • Students should explore the above and also understand the principle features of the RDF model.
C.6.4Distinguish between an ontology and folksonomy.
An ontology is a standardised vocabulary(that avoids ambiguities) for use on the web that allows data from different enterprises to be usefully combined. It includes the use of relationship (e.g. creator = author).

A folksonomy is a more informal ontology that has evolved throughthe use of tags posted by ordinary users.

Students should look at specific examples of each (e.g. ontologies: DBPedia/ incorporating book data from different booksellers; folksonomy: use in photo albums, blogs, delicious.com).

C.6.5Describe how folksonomies and emergent social structures are changing the web.
Following on from above, students should discuss how sites that allow users to tag the elements on those sites make the data more accessible. Sites to look at include:
  • Technorati
  • Delicious
  • Flickr
  • MetaFilter

Are these new structures a viable alternative to search engines?

C.6.6Explain why there needs to be a balance between expressivity and usability on the semantic web.
Discuss the balance between creating web pages for the benefit of people or for the benefit of machines.
C.6.7Evaluate methods of searching for information on the web.
Can YouTube be classed as a search engine? Google’s Panda puts the focus on quality. Cloud Kite (Open Drive) for searching the cloud. Multimedia search engines (visual / audio).
C.6.8Distinguish between ambient intelligence and collective intelligence.
Ambient intelligence collects and processes data from the physical surroundings in order to provide a unique user experience.

Collective intelligence collects and processes data about a particular topic from around the web.

C.6.9Discuss how ambient intelligence can be used to support people.
Be able to discuss different examples looking at both positive and negative consequences. The discussion should include the technology needed for this, such as nanotechnology, biometrics, sensors etc.
C.6.10Explain how collective intelligence can be applied to complex issues.
Research examples such as climate change, social bookmarking, and stock market fluctuations.