Readers: 3 | Updated: 06-04

How Search Engines Can Index Pages in Parts

Translate Into:

Web pages can contain a lot of information about various types of objects such as products, people, papers, organizations, and so on. Information about those objects may be spread out on different pages, at different sites.

For example, a page may host a product review of a particular model of camera, and another page may present an ad offering to sell that model of camera at a certain price.

One page might display a journal article, and another page could be the homepage for the author of that article.

Someone searching for information about the camera, or about the author may need information contained in both pages. They may have to use a search engine to locate multiple pages, to find the information that they need.

If there were a way for a search engine to automatically identify when information on different web pages relates to the same object, that might be helpful to searchers in a number of ways.

Product searches would benefit in helping shoppers find information and prices about goods that they might want to purchase. A repository of scientific papers could gain by providing additional information about the authors of papers.

Extracting Object Blocks from Pages

A patent granted to Microsoft today explores the concept of extracting object blocks from pages, so that information that appears on pages about specific objects can be grouped together.

Often, information about a specific object is presented on a page with information about other objects. The Microsoft process would break information from a page into different blocks relating to the different objects found on that page.

It might look at visual features of a page, such as different font sizes and separating lines, to help identify object elements. It may search for elements within each object block that shows that the block involves a particular object.

After a page is broken into object blocks, it might attempt to label elements of each object, and information about the objects may be stored in a data base for objects, along with other information blocks from other pages that involve the same objects.

Method and system for identifying object information
Invented by Ji-Rong Wen, Wei-Ying Ma, Zaiqing Nie
Assigned to Microsoft
US Patent 7,383,254
Granted June 3, 2008
Filed April 13, 2005

Abstract

A method and system for identifying object information of an information page is provided. An information extraction system identifies the object blocks of an information page. The extraction system classifies the object blocks into object types.

Each object type has associated attributes that define a schema for the information of the object type. The extraction system identifies object elements within an object block that may represent an attribute value for the object.

After the object elements are identified, the extraction system attempts to identify which object elements correspond to which attributes of the object type in a process referred to as “labeling.” The extraction system uses an algorithm to determine the confidence that a certain object element corresponds to a certain attribute.

The extraction system then selects the set of labels with the highest confidence as being the labels for the object elements.

Identifying object information on a web page

An advertisement on a web page for a camera may be an object block and the matching object is the specific model of camera.

An object block that advertises a camera may be classified as a product type, and an object block relating to a journal paper may be classified as a paper type.

Each object type has associated attributes.

A product object type may have attributes of:

Manufacturer,
Model,
Price,
Description, and;
so on.

A paper object type may have attributes of:

Title,
Author,
Publisher, and;
so on.

When a page is broken into blocks, an information extraction system might attempt to associate attributes of an object with values from the block.

So, for a camera, “Sony” might be identified as a manufacturer attribute and “$599″ as a price attribute.

Blocks and Objects

About a month ago, I wrote about how Microsoft described breaking pages down into parts and deciding which parts of those pages where the most important parts, in Search Engines, Web Page Segmentation, and the Most Important Block.

Microsoft is exploring indexing information at an information block level, instead of a page level. This patent brushes upon the idea of breaking apart the content of a page based upon the HTML code of the page and the visual aspects of how information on a page is presented, which it could do using a VIPS: a Vision-based Page Segmentation Algorithm (pdf).

The ideas in this newly granted patent take that concept a step further, and discuss the exploration of different segments on pages for information that relates to different objects, extracting information from those segments about different attributes or aspects of those objects, and relating those different objects together in the same data store, so that they can be accessed by people.

There are a few white papers from Microsoft that explore these ideas more fully:

Microsoft has produced a couple of examples where they are clearly using this kind of object level searching at:

In 2006, Google described the use of visual segmentation of content on pages, and extraction of information from different segments, within the context of grabbing information from pages offering multiple reviews for local search, which I wrote about in Google and Document Segmentation Indexing for Local Search.

Yahoo also recently published a patent application that broke pages down into parts, and attempted to identify the most important parts of the page. More on that at: The Importance of Page Layout in SEO.

If you publish web pages, how might the search engines be breaking apart the content of your pages?


Copyright © 2008 SEO by the SEA. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana


From The Blogs

Life, Health, Furnishings

2007
4 Ways to Raise Your Charm Index
Everyone hopes they are charmful, so what is charm? How to raise the quality of charm? Let us tell you show you the way to max out you charm. What charm is meant for is not only simply beautiful faces... 查看全文

Hack the Day

04-10
Top 10 Firefox search engines
Quickly, tell me the web browser you use most frequently. What? Do I hear anything other than Firefox? Youre most certainly way behind on our Productivity 101 lesson. Quickly, tell me whats your most ... 查看全文

ProBlogger Blog Tips

03-29
20 Types of Pages that Every Blogger Should Consider
When you use WordPress youre given the choice when publishing between doing it as a post or as a page. Posts go up on your blog while pages are static pages that you can publish without it having to g... 查看全文

ReadWriteWeb

03-05
The Top 100 Alternative Search Engines, March
RWW Editor's Note: This is the latest installment of The Top 100, a popular monthly feature from our AltSearchEngines blog. The list includes the best People search engines, Job, Health, Media, Local,... 查看全文

Dosh Dosh

03-26
Reference Pages: How You Can Use Them to Attract Links and Traffic
One way of attracting visitors to your website is to create content or webpages that are entertaining or filled with valuable information. If you own a blog and have been publishing for some time, you... 查看全文

TechCrunch

04-15
Google Maps + YouTube Videos = Local Video White Pages
Google has long been using Google Maps as a way to show local business listings.When you search for hotels in a given city, for instance, each digital pushpin that appears on the map can be clicked on... 查看全文

Blog Design, Optimization & Usability - TopRank Online Marketing

01-15
Writing Title Tags for Search Engines & People
The title tag is probably one of the most important items on your blog posts.It not only attracts visitors, but is also used to help determine how the post ranks.However, whats good for search engines... 查看全文

2007
bluepulse
http://www.bluepulse.com Bluepulse is the best way of taking the internet with you on your phone because it's fast, easy and works on most regular mobile phones.Once you've installed bluepulse on your... 查看全文

Daily Blog Tips

03-10
Check How Search Engines See Your Website
Ever wondered how Google and its pals see your website? There is a little tool called SEO Browser that can help you here. Basically you just need to type your URL and it will display the information t... 查看全文

SEO by the SEA

05-09
Search Engines, Web Page Segmentation, and the Most Important Block
Many web pages contain more than one topical section, or blocks, which may make it difficult for a search engine to tell what a page is about when it is trying to index that page. These blocks may inc... 查看全文
More Articles
Elanso is a professional online platform which provides translation service for corporate or individule clients, opportunities for translation practice and translation jobs, and translation tool/software-download. Our online translators provide about 186 languages' translation service, including Japanese,Korean, French, German, Spanish, etc, among which, 20,000 are English translators. And some big translation service companies in Shanghai, Beijing, Nanjing also registered here.