Google搜索引擎的工作原理

搜索引擎之于Google,就像Windows之于微软,搜索引擎是互联网巨头Goolge的王牌,而今搜索对互联网来说太重要了,互联网搜索甚至已经提升到了搜索文化的高度,在海量信息的因特网上,搜索引擎是网民的罗盘,而Google做出了Internet上最好的指南针。

PPCblog.com呈现给我们一幅由Jess Bachman(在WallStats.com工作)精心描绘的示意图,这张流程图展示了每天拥有3亿次点击量的Google搜索按钮背后搜索引擎在那不到1秒的响应时间内所进行的处理。

Google(graphic) - How Google Works

这是我刚付印的最新示意图,这张流程图演示了在你点击Google搜索按钮后,在Google返回查询结果前那一眨眼的功夫里,Google是如何处理你的搜索请求的?这可是搜索巨人Google年赢利额高达200亿美元的杀手级应用,也是Internet首屈一指的商业和技术神话,大家肯定都想知道Google这棵摇钱树背后的秘密。

Google官方对其搜索技术的叙述

我们搜索技术的后端软件会在服务器侧触发一系列执行时间不到1秒的并行计算,Google问世前的传统搜索引擎的搜索结果严重依赖于关键词在页面上出现的频度,我们使用了200多个指标信号(其中包括我们拥有专利的PageRank页面等级加权算法)用来检查万维网的链接结构(佩奇和布林最初的想法是把万维网的链接结构用图论的有向无环图来建模)并决定网页的重要程度,我们假定一个网页的重要程度取决于别的页面对它的引用,就像学术论文中的引用指数一样,重要的论文总是会被很多其他论文引用。然后我们再根据搜索条件进行超文本匹配分析(对bot抓取的页面内容进行关键词倒排索引检索)确定跟搜索请求最相关的网页。综合最重要的网页和跟搜索请求最相关的网页两个方面,我们就能按重要程度和用户搜索请求相关程度把查询结果排序后呈现给我们的用户。

数据中心:Google用来索引世界的塔

Google的数据中心高度机密,我们能了解到的不多:

1. 在美国本土有19个以上的数据中心,其余17个数据中心分布在美国以外的世界各地。

2. 每个数据中心有50万平方英尺那么大,建造一个数据中心要花费约6亿美元。

3. Google数据中心是世界上最高效的设施之一,而且也非常环保,几乎没有碳排放。

4. 数据中心使用50到100兆瓦的电力,由于需要冷却,通常建在便于用水的地方。

5. Google服务器安置在一个一组容得下1160台服务器的有房子那么大的标准集装箱容器中。

处理流程

1.你写博客、或在Twitter上推微博、更新站点等诸如此类往Web上添加内容的操作

2.Google bots程序(一种作为搜索引擎构件的智能代理程序)抓取你网页的title和description、keyword等内容

(1)Google爬虫沿着链接路径周游万维网,如果没有超文本路径到你的站点,你的站点将不会被索引

(2)如果你在robots.txt中设置不许索引,Google爬虫程序将不会抓取你的网页

(3)如果链接到你站点的超文本链接上有nofollow标签,Google爬虫将不会从这些链接路径周游到你的站点。

(4)Google也能通过blog软件或xml站点地图找到你的网站

(5)从PageRank越高的网站链接到你的网站的链接越多,你的网站的PageRank就越高。

(6)Google爬虫将周游所有未标注为nofollow的链接

3.一旦被Google爬虫访问到,网页几秒内就被索引了

(1)网页内容被存储在一个倒排索引中

① 网页标题和链接数据被保存在一个索引中,用于广度优先搜索

② 网页内容保存在另一个索引中,以用于检索频率不高的长尾、个性化、深度优先搜索

(2)当你用Google搜索时,你并没有在检索时时更新的万维网,而是在检索Google的缓存,Google定期更新其索引库,在Twitter实时搜索等的竞争下,Google的索引库更新周期趋短。

4.Google基于链接评估域名和网页的总体PageRank值。

5.检查网页以防止作弊行为

(1) Google的搜索质量和反垃圾信息审查和优化算法

(2) 1万多远程测试用户评价搜索结果的质量

(3) Google征请用户对有PageRank讹诈嫌疑的垃圾信息进行举报

(4) Google接到 (美国)数字千年版权法案的通知,要求Google从搜索结果中剔除涉嫌盗版的内容

6.在对页面做了损害分析后,现在每个页面都有很多用于辅助用户搜索的数据片(比如检索关键词)反向引用着它

7.用户发出搜索请求

(1)Google搜索质量工程师Patrick Riley:在大多数Google搜索中,你的搜索处于许多并行的控制过程或Google实验室的创新项目组过程中,可以说每一个查询请求都会参与一些Google的创意实验。

8.Google会用同义词匹配与你的搜索关键词语义相近的查询结果

9.生成初步的查询结果

(1)Google当然能返回成千上万数量无限的查询结果,但一般只显示不到1000条的查询结果,出于“少则得,多则惑”的考虑。

(2)对查询结果做本地化处理,本土站点在查询结果中优先出现

10.对查询结果集按权威性和PageRank进行排序,重复的查询结果被剔除。

(1) Google根据关键词、广告类型、用户所处位置找出相关的被竞价拍卖的关键词广告

(2) 关键词广告必须遵守当地法律条文

① 广告业主的非法广告将被取缔

② 如果关键词的搜索流量过低或关键词广告点击量偏低,则会被自动禁用

③ 出于商业策略,像亚马逊这样的客户会给予优惠折扣。

(3) 关键词相关广告按收益潜力(对关键词进行竞价拍卖后的广告质量不断进行评估)排序

(4) 对广告业主来说广告内容一般都是固定的,但有时使用动态关键词使关键词广告与搜索关键词相关度更高

① 一些广告本身允许增加易变的附属信息,比如网站链接、电话号码、产品链接、地址等

(5) 当广告拥有了相当高的点击率,则会显示在搜索结果列表的上方,以使其更显眼。

(6) 其余的广告依序显示在相应的位置

11.对查询结果进行过滤处理

(1) 对通常的查询(比如在Google首页上发出的搜索请求),Google会把相关的专题性垂直搜索结果(比如新闻、购物、视频、书籍、地图等)也加到返回的查询结果中

(2) 个性化方面:用户访问过的网站在查询结果列表中会更靠上

(3) 大量使用锚点的网站有可能被从查询结果中删除

(4) 搜索结果集的聚簇性:如果网页被其他高PageRank的网站引用,则网页的重要性会大大提高。

(5) 趋势分析:对搜索流量爆增或有大量新闻的搜索关键词,Google会在新的查询结果中增加额外的PageRank权值。(Google有反映关键词搜索流量的Google趋势专题页面)

(6) 同一个域名下的多个网页如果具有相同的PageRank会被归为一组。

  1. 最终返回给浏览器端的用户一个人性化的、布局良好的、查询结果和广告泾渭分明的有机查询结果页面。

所有这些步骤在总共不到1秒的响应时间内完成,每天3亿次的点击量给Google带来了超过200亿美元的年收入。


honest Translate的身份于2010年发表于 译言网 Google搜索引擎的工作原理,现在GitHub上留作备份,翻译不当之处欢迎指正

原文地址:Google(graphic) - How Google Works

arXiv papers

DSOS and SDSOS Optimization: More Tractable Alternatives to Sum of Squares and Semidefinite Optimization

Amir Ali Ahmadi, Anirudha Majumdar

https://arxiv.org/abs/1706.02586

In recent years, optimization theory has been greatly impacted by the advent of sum of squares (SOS) optimization. The reliance of this technique on large-scale semidefinite programs however, has limited the scale of problems to which it can be applied. In this paper, we introduce DSOS and SDSOS optimization as more tractable alternatives to sum of squares optimization that rely instead on linear and second order cone programs respectively. These are optimization problems over certain subsets of sum of squares polynomials (or equivalently subsets of positive semidefinite matrices), which can be of interest in general applications of semidefinite programming where scalability is a limitation. We show that some basic theorems from SOS optimization which rely on results from real algebraic geometry are still valid for DSOS and SDSOS optimization. Furthermore, we show with numerical experiments from diverse application areas—polynomial optimization, statistics and machine learning, derivative pricing, and control theory—that with reasonable tradeoffs in accuracy, we can handle problems at scales that are currently far beyond the reach of sum of squares approaches. Finally, we provide a review of recent techniques that bridge the gap between our DSOS/SDSOS approach and the SOS approach at the expense of additional running time. The appendix of the paper introduces an accompanying MATLAB package for DSOS and SDSOS optimization.

https://www.solidot.org/story?sid=56606

A Practical \( O(R\log\log n+n) \) time Algorithm for Computing the Longest Common Subsequence

Daxin Zhu, Lei Wang, Yingjie Wu and Xiaodong Wang

https://arxiv.org/abs/1508.05553

In this paper, we revisit the much studied LCS problem for two given sequences. Based on the algorithm of Iliopoulos and Rahman for solving the LCS problem, we have suggested 3 new improved algorithms. We first reformulate the problem in a very succinct form. The problem LCS is abstracted to an abstract data type DS on an ordered positive integer set with a special operation Update(S,x). For the two input sequences X and Y of equal length n, the first improved algorithm uses a van Emde Boas tree for DS and its time and space complexities are \( O(R\log\log n+n) \) and O(R), where R is the number of matched pairs of the two input sequences. The second algorithm uses a balanced binary search tree for DS and its time and space complexities are \( O(R\log L+n) \) and O(R), where L is the length of the longest common subsequence of X and Y. The third algorithm uses an ordered vector for DS and its time and space complexities are O(nL) and O(R).

React技术栈相关

适用

  • If you plan to build a large scale app, go with React.
  • If you want a library that is adaptable for both web and native apps, go with React.
  • If you want the biggest ecosystem, go with React.

Vue作者尤雨溪(Evan You)论React

  • 要区分两个概念:React 本身和 React 生态圈所推崇的主流应用架构
  • React 本身其实还算简单的。最简单的理解,一个组件的渲染函数就是一个基于 state 和 props 的纯函数,state 是自己的,props 是外面来的,任何东西变了就重新渲染一遍,是不是很简单?
  • Flux/Redux 的繁琐,本质上是针对大型应用的复杂度所作出的权衡:用繁琐一些的 API,换长线的可维护性。
  • MobX 是适合中小规模应用的状态解决方案,然而用 React + MobX 本质上就是一个更繁琐的 Vue

Redux – 状态管理方案

Redux 的设计思想

  • Web 应用是一个状态机,视图与状态是一一对应的。
  • 所有的状态,保存在一个对象里面。

Spring Boot

Spring Boot:构建 Spring 应用程序的现代方式

Spring Boot 基础


Spring Boot的文档
https://docs.spring.io/spring-boot/docs/current/reference/

Building an Application with Spring Boot
https://spring.io/guides/gs/spring-boot/

Building Java Projects with Maven
https://spring.io/guides/gs/maven/

Building Java Projects with Gradle
https://spring.io/guides/gs/gradle/

Working a Getting Started guide with STS
https://spring.io/guides/gs/sts/

Spring Tool Suite
https://spring.io/tools/sts/all


深入学习微框架:Spring Boot
http://www.infoq.com/cn/articles/microframeworks1-spring-boot


From Zero to Hero with Spring Boot

Brian Clozel shows how Spring Boot can help build web applications, tests to production-ready features, that leverage the Spring ecosystem.

https://www.infoq.com/presentations/spring-boot-web-dev

https://github.com/bclozel/issues-dashboard/


Getting Started with Microservices in SpringBoot
https://www.infoq.com/articles/Microservices-SpringBoot

Exploring Micro-frameworks: Spring Boot
https://www.infoq.com/articles/microframeworks1-spring-boot

Mozilla 的 Rust 编程语言

Why you should learn the Rust programming language

Discover the history, key concepts, and tools for using Rust
https://www.ibm.com/developerworks/library/os-developers-know-rust/index.html


Rust 编程语言入门

您是开发人员吗?如果是,您需要知道 Rust。
https://www.ibm.com/developerworks/cn/opensource/os-developers-know-rust/index.html


Beginner’s Guide to Rust
Get to know Rust
A compiled alternative to C and C++
https://www.ibm.com/developerworks/opensource/library/os-know-rust/index.html

Rust 初学者指南
初识 Rust
一种替代 C 和 C++ 的编译语言
https://www.ibm.com/developerworks/cn/opensource/os-know-rust/index.html


Beginner’s Guide to Rust
Start coding with the Rust language
Put your Rust skills to use by building a simple Tic Tac Toe game
https://www.ibm.com/developerworks/opensource/library/os-using-rust/index.html

开始使用 Rust 语言进行编码
运用您的 Rust 技能构建一个简单的井字棋游戏
https://www.ibm.com/developerworks/cn/opensource/os-using-rust/index.html


Why you should learn the Rust programming language
Discover the history, key concepts, and tools for using Rust
https://www.ibm.com/developerworks/library/os-developers-know-rust/index.html


Why Rust is the Most Loved Language by Developers

https://medium.com/mozilla-tech/why-rust-is-the-most-loved-language-by-developers-666add782563


Mozilla bets its Rust language will make your internet safer

JavaScript, a smash hit among programmers, made the web powerful. Now Mozilla’s Rust could protect the web from hack attacks.

https://www.cnet.com/news/mozilla-designs-rust-language-for-safe-secure-internet/


What is Rust? Safe, fast, and easy software development

The Rust language’s unique approach results in better code with fewer compromises than C, C++, Go, and the other languages you probably use

https://www.infoworld.com/article/3218074/application-development/what-is-rust-safe-fast-and-easy-software-development.html


Rust in 2018: it’s way easier to use!

https://jvns.ca/blog/2018/01/13/rust-in-2018--way-easier-to-use/

I’ve been using Rust on and off since late 2013. 4 weeks ago, I picked up Rust again, and the language is SO MUCH EASIER than it was the last time I tried it (May 2016). I think that’s really exciting!


tweets of John Carmack

Quality, reliable software can be delivered in any language, but language choice has an impact. For me, C would be a middle-of-the-road choice; better than a dynamic language like javascript or python, but not as good as a more modern strongly static typed languages. However, the existence of far more analysis tools for C is not an insignificant advantage. If you really care about robustness, you are going to architect everything more like old Fortran, with no dynamic allocations at all, and the code is going to look very simple and straightforward.

So an interesting question: What are the aspects of C++ that are real wins for that style over C? Range checked arrays would be good. What else?

Rust would be the obvious things, and I don’t have any reason to doubt it would be good, but I haven’t implemented even a medium sized application in it. I have been checking on the progress for a while, and I like what I see, but I haven’t coded in it.

值得关注的GitHub用户及项目

project

Algorithms, 4th edition textbook code and libraries
https://github.com/kevin-wayne/algs4

JavaPerformanceTuning
https://github.com/ScottOaks/JavaPerformanceTuning

Examples for O’Reilly & Associates Java Performance Tuning: The Definitive Guide

Code for the book Grokking Algorithms
https://github.com/egonSchiele/grokking_algorithms

Grokking Algorithms

An illustrated guide for programmers and other curious people

Aditya Y. Bhargava

people

Zhiqiang Zhang
https://github.com/zhangzq

Yihui Xie
https://github.com/yihui

Ruan YiFeng
https://github.com/ruanyf

Hao Chen
https://github.com/haoel

bigsai
https://github.com/javasmall

xiaolai
https://github.com/xiaolai

Brian Clozel
https://github.com/bclozel

常用git命令

git init 会将当前目录直接转换为 Git 工作目录并在此处创建的 .git(隐藏)目录中创建存储库。

查看哪些文件处于什么状态
Show the working tree status
git status

git status 有助于跟踪已添加的更改、未添加的更改以及更改所在的分支。

查看尚未暂存的文件更新了哪些部分
git diff

直接将远程镜像克隆到本地仓库
git clone git@github.com:sikaozhe1997/Xin-Yue.git

查看远程仓库
git remote -v

从远程获取最新版本到本地
git fetch origin master

比较本地的仓库和远程参考的区别
git log -p master.. origin/master

switch branches: git checkout <branch>

switch branches with new branch created: git checkout -b <branch>

view branches: git branch

add files: git add file1 file2

commit: git commit -m “comments”

放弃修改

没有使用git add到暂存区

git checkout – filename

已经git add到暂存区

git reset HEAD filename

git add & git commit 之后

git reset commit_id

这个id是你想要回到的那个节点,可以通过 git log 查看

Configure your DVCS username for commits

global

git config --global user.name “FIRST_NAME LAST_NAME”

git config --global user.emailMY_NAME@example.com

repository-specific

git config user.name “FIRST_NAME LAST_NAME”

git config user.emailMY_NAME@example.com

git-stash

  • git stash

  • git stash list

  • git stash apply

  • git stash apply stash@{1} or git stash apply stash@"{1}"

  • git stash drop stash@{0} or git stash drop stash@"{0}"

  • git stash save "message"

  • git stash pop stash@{1} or git stash pop stash@"{1}"

pop vs apply

  • pop : Remove a single stashed state from the stash list and apply it on top of the current working tree state
  • apply: Like pop, but do not remove the state from the stash list.

checkout

  • git checkout --track origin/newsletter

Branch newsletter set up to track remote branch newsletter from origin.
Switched to a new branch ‘newsletter’

  • git checkout branch_name -- paths

clone

git clone --depth 1 url

Create a shallow clone with a history truncated to the latest commits.

rename branch

git branch -m main

del branch

delete local branch

  • git branch -d branch_name
  • git branch -D branch_name

The -d option stands for –delete, which would delete the local branch, only if you have already pushed and merged it with your remote branches.

The -D option stands for –delete –force, which deletes the branch regardless of its push and merge status, so be careful using this one!

delete remote branch

git push origin --delete branch_name

diff

  • git diff pom.xml
  • git diff branch1 branch2 – file

Unable to clone Git repository due to self signed certificate

git config --global http.sslVerify false

Git Pull Vs Git Fetch

Fetch

git fetch really only downloads new data from a remote repository - but it doesn’t integrate any of this new data into your working files. Fetch is great for getting a fresh view on all the things that happened in a remote repository.
Due to it’s “harmless” nature, you can rest assured: fetch will never manipulate, destroy, or screw up anything.

Pull

git pull, in contrast, is used with a different goal in mind: to update your current HEAD branch with the latest changes from the remote server. This means that pull not only downloads new data; it also directly integrates it into your current working copy files. This has a couple of consequences:

  • Since “git pull” tries to merge remote changes with your local ones, a so-called “merge conflict” can occur. Check out our in-depth tutorial on How to deal with merge conflicts for more information.
  • Like for many other actions, it’s highly recommended to start a “git pull” only with a clean working copy. This means that you should not have any uncommitted local changes before you pull.

keep a git branch in sync with master

1
2
3
4
5
6
7
git checkout master

git pull

git checkout feature_branch

git merge master
1
2
3
4
5
git checkout master

git merge feature_branch

git push origin master

git-reset

  • git reset --hard HEAD@{1} : if you’re feeling the moxy and haven’t done anything else

git reflog

git log

git log path

git-cherry-pick

git cherry-pick master

list conflicted files

git diff --name-only --diff-filter=U

force git pull to overwrite local files

1
2
git fetch --all
git reset --hard origin/branch_name

Force git stash to overwrite added files

git checkout stash -- .

github fork sync

1
2
git remote add upstream https://github.com/kevin-wayne/algs4.git
git pull upstream master

analyze metadata

ordered list of authors by commit count

git shortlog -sn

git shortlog -sn --no-merges

checkout any branch and overwrite local changes

git fetch --all

git reset --hard origin/abranch

git checkout $branch

Changing a commit message

change the most recent commit message

git commit --amend

change GitHub default branch from master to main

git config –global init.defaultBranch main