Google Says ...: Google shares the love...

The more I read about Google's activities, the more impressed I have become with TSETSB. There are some things Google does that I don't approve of.

Two brief rants before we get to the Google raves

For example, I will never forgive them for Web Accelerator, which I continue to block from my network at the server level. Web Accelerator eats up bandwidth and Google has yet to offer an effective means of compensating Webmasters for needlessly wasted bandwidth. In Google's defense, I will say that their technology only incorporates an incredibly stupid standard that was proposed by people who should have known better than to come up with such a dumb idea. Maybe that's why it's so popular, I don't know. But I fear the day is coming when I'll be blocking all FireFox users from Xenite.Org. Unless they want to pay for subscriptions to our content.

I also complained about Google's removing the fetch date from their cache. I get so many hits from Googlebot in my server logs, figuring out when they actually fetched a file is not easy for me. And when other people ask me to do research on their sites, I am now almost completely blinded in one eye thanks to this ridiculous "improvement" from Google.

Google snuggles with the guv'mint

But enough whining and moaning. There are a lot of great things Google has done, is doing, that I can talk about. For example, I notice they are snuggling up with the U.S. Government these days. Adam Lasnik is teaching a class on search optimization to government Web designers at Washington University. And Google (Enterprise) has been named most influential commercial company providing technology to the Federal IT market. It was only a matter of time, I suppose.

But the ever impressive and highly innovative Google Book Search people (who, theoretically, should be on my list of demons because I have published books) have now announced that DIANE Publications has made its entire inventory of reprinted government publications available on Google Book Search. You know, we taxpayers paid for all that data collection and reporting, so it's about time we get access to it. This is actually a public service from Google that should help historians and people who are curious about what sort of publications the government has spent their money on in the past.

Lesson for Business: Share what you do!

What can smaller businesses learn from Google Book Search and Google Enterprise? I'd say that if you have partnerships with larger entities where your services or products are playing a significant role, you should be writing about those relationships on your corporate blogs. Put feature articles on your Web site. Mention your hallmark accmoplishments in your company history page. Tell people what you are doing for others, so they can get an idea of what you may be able to do for them.

After hours with Googlers, innovation, and invention

Innovation, of course, doesn't have to come from the corporate production process. One Googler offers a tip for organizing temporarily necessary cell phone numbers. Leave it to someone associated with search to think of prefixing names in a phone list.

Another Googler provides a fantastic report on Maker Faire, where innovation comes to life. The report will take a week for anyone to evaluate, but it's loaded with details, pictures, and video. Oh, my!

Significant revelations from Google

But now we're getting down to today's good stuff.

Google revives Tesseract OCR

First up, Google Code recently announced that Google had revived Hewlett-Packard's OCR technology (HP retired Tesseract in the mid-1990s). Think this is how Google has been scanning all those books? It doesn't matter, because as I pondered over the meaning of this post for the umpteenth time, today it hit me: Google may eventually be able to read all those graphics people use on their front pages. You know what I mean: the huge image files that say, "Michael Martinez is the best SEO in the world and you really should be paying him to help you rank at Google".

How many SEOs have complained about having to work around those Greeting Card images? Well, prognosticating what Google will do with its technology is not very productive, but if they are not thinking about how to scan Web greeting images and masthead graphics, they should be. Because there are just too many people who don't understand that a search engine cannot index the text embedded in a .GIF or .JPG.

Vanessa explains SiteLinks

The Webmaster Central Blog explains one of those curious SERP features that have puzzled, bemused, and bedazzled SEOs for years: SiteLinks (love the name, btw). SiteLinks are those sets of tightly compacted deep links that occasionally are included in a site's listing.

Google provides four levels of recognition for a Web site:

A simple listing for a single page in relation to the user's query.

Two listed pages, one indented under the first, in relation to to the user's query.

Two listed pages as above, but with an additional tag offering "More pages URL"

One or two listed pages as above, but with a compact list of SiteLinks providing quick access to deeper content

SEOs have lusted after those impressive SiteLinks results ever since they first started appearing. My most important site, Xenite.Org, has so far only achieved level three recognition despite many deep links and deep referrals. A lot of my pages come up in Google SERPs. But it takes more than what I've got so far to hit level four recognition.

NOTE: Some people might argue that having pictures from your site featured above search results, such as for Lucy Lawless, is a fifth level of recognition.

In any event, Google says that SiteLinks are completely automated. Maybe they are, but if any SEOs can figure out how to trigger their generation in SERPs, I think those SEOs will make even more money than before. Frankly, I haven't really tried to figure out the process.

And now, for the gold: Sharding

Do you know what shards are? I only have the vaguest idea, myself. I've watched a number of videos of Googlers making presentations. I've read some technical stuff. But I've never seen a shard in action. Google's database is so large it cannot all be contained on one server. Google reportedly uses up to 1,000 PCs to resolve any query. The database is spread out across some or all of those PCs in what Google calls "shards".

Last month, a Google went to a BarCamp and made a presentation called Scaling Data On The Cheap. Yup, he talked about shards.

Slide shows don't tell you a great deal when you cannot hear what the speaker has to say. But we can infer a few (possibly very incorrect) ideas from the slides. For example, it appears from one slide that a table could be replicated in multiple shards, split across multiple shards, or comprise a single shard by itself.

Google's original architecture (most likely no longer in use, at least going back to the January 2006 Big Daddy update, if not earlier) used many tables. There would have to be one or more master tables just to tell the various programs where all the other tables are. The paper says they had identified about 14 million words. Each word would have to have its own index. Rare words (occuring in the fewest documents) would have the smallest tables.

I can envision some programmatic advantages to replicating rare word tables across multiple shards, pairing some rare words with others in specific shards. And obviously large tables for the most common words would probably have to be split across multiple shards.

There is probably only minimal value to an SEO in knowing and understanding how shards actually work, but the slide show implies a great deal of redundancy has been built into Google's system architecture. It's like they have a lot of floppy database thingees that they lay partially across each other like blankets.

Well, it's food for thought, but I've already spent too much time on this post.

2 Comments:

Ryan said...: Hi Michael! Just so you know, that sharding presentation describes how we scale traditional databases, like MySQL, to handle apps like AdWords. It's designed primarily for session-based, write-heavy, interactive and transactional usage.

The search engine itself is built on different stuff. RDBMSes wouldn't be appropriate for it.; 10:18 PM
Michael Martinez said...: I appreciate the explanation and clarification, Ryan. Always a bit of a disappointment to learn when I've pointed my algorithm hounds in the wrong direction, but I do get to see so much of the technical world that way anyway. :); 10:44 PM

<< Home

Google Says ...

About Me

Michael's Web

Thursday, September 07, 2006

Google shares the love...