跳至主要内容

分布式系统领域的经典论文【转载】

作者:严林  编辑于 2015-05-08
链接:https://www.zhihu.com/question/30026369/answer/46476717
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

分布式系统在互联网时代,尤其是大数据时代到来之后,成为了每个程序员的必备技能之一。分布式系统从上个世纪80年代就开始有了不少出色的研究和论文,我在这里只列举最近15年范围以内我觉得有重大影响意义的15篇论文(15 within 15)。

1. The Google File System:
这是分布式文件系统领域划时代意义的论文,文中的多副本机制、控制流与数据流隔离和追加写模式等概念几乎成为了分布式文件系统领域的标准,其影响之深远通过其5000+的引用就可见一斑了,Apache Hadoop鼎鼎大名的HDFS就是GFS的模仿之作;

2. MapReduce: Simplified Data Processing on Large Clusters:
这篇也是Google的大作,通过Map和Reduce两个操作,大大简化了分布式计算的复杂度,使得任何需要的程序员都可以编写分布式计算程序,其中使用到的技术值得我们好好学习:简约而不简单!Hadoop也根据这篇论文做了一个开源的MapReduce;

3. Bigtable: A Distributed Storage System for Structured Data:
Google在NoSQL领域的分布式表格系统,LSM树的最好使用范例,广泛使用到了网页索引存储、YouTube数据管理等业务,Hadoop对应的开源系统叫HBase(我在前公司任职时也开发过一个相应的系统叫BladeCube,性能较HBase有数倍提升);

4. The Chubby lock service for loosely-coupled distributed systems:
Google的分布式锁服务,基于Paxos协议,这篇文章相比于前三篇可能知道的人就少了,但是其对应的开源系统zookeeper几乎是每个后端同学都接触过,其影响力其实不亚于前三篇;

5. Finding a Needle in Haystack: Facebook's Photo Storage:
facebook的在线图片存储系统,目前来看是对小文件存储的最好解决方案之一,facebook目前通过该系统存储了超过300PB的数据,一个师兄就在这个团队工作,听过很多有意思的事情(我在前公司的时候开发过一个类似的系统pallas,不仅支持副本,还支持Reed Solomon-LRC,性能也有较多优化);

6. Windows Azure Storage: a highly available cloud storage service with strong consistency:
windows azure的总体介绍文章,是一篇很好的描述云存储架构的论文,其中通过分层来同时保证可用性和一致性的思路在现实工作中也给了我很多启发;

7. GraphLab: A New Framework for Parallel Machine Learning:
CMU基于图计算的分布式机器学习框架,目前已经成立了专门的商业公司,在分布式机器学习上很有两把刷子,其单机版的GraphChi在百万维度的矩阵分解都只需要2~3分钟;

8. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing:
其实就是 Spark,目前这两年最流行的内存计算模式,通过RDD和lineage大大简化了分布式计算框架,通常几行scala代码就可以搞定原来上千行MapReduce代码才能搞定的问题,大有取代MapReduce的趋势;

9. Scaling Distributed Machine Learning with the Parameter Server:
百度少帅李沐大作,目前大规模分布式学习各家公司主要都是使用ps,ps具备良好的可扩展性,使得大数据时代的大规模分布式学习成为可能,包括Google的深度学习模型也是通过ps训练实现,是目前最流行的分布式学习框架,豆瓣的开源系统paracell也是ps的一个实现;

10. Dremel: Interactive Analysis of Web-Scale Datasets:
Google的大规模(近)实时数据分析系统,号称可以在3秒相应1PB数据的分析请求,内部使用到了查询树来优化分析速度,其开源实现为Drill,在工业界对实时数据分析也是比价有影响力;

11. Pregel: a system for large-scale graph processing:
Google的大规模图计算系统,相当长一段时间是Google PageRank的主要计算系统,对开源的影响也很大(包括GraphLab和GraphChi);

12. Spanner: Google's Globally-Distributed Database:
这是第一个全球意义上的分布式数据库,Google的出品。其中介绍了很多一致性方面的设计考虑,简单起见,还采用了GPS和原子钟确保时间最大误差在20ns以内,保证了事务的时间序,同样在分布式系统方面具有很强的借鉴意义;

13. Dynamo: Amazon’s Highly Available Key-value Store:
Amazon的分布式NoSQL数据库,意义相当于BigTable对于Google,于BigTable不同的是,Dynamo保证CAP中的AP,C通过vector clock做弱保证,对应的开源系统为Cassandra;

14. S4: Distributed Stream Computing Platform:
Yahoo出品的流式计算系统,目前最流行的两大流式计算系统之一(另一个是storm),Yahoo的主要广告计算平台;

15. Storm @Twitter:
这个系统不多说,开启了流式计算的新纪元,几乎是所有公司流式计算的首选,绝对值得关注;

评论里边 提到的两篇论文也挺不错的,一并补充在这里。
1. Large-scale cluster management at Google with Borg
2. F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business
 

Popular posts from 产品随想的博客

巴菲特致股东信-1980年

 笔记: 会计中对于下属股权公司的记账方式有3种: 持股50%以上,全部并入 持续20%--50%,则按持股比例并入 持股20%以下,则以实际收到的利润返还,计入报表 这种会计方式,会导致伯克希尔旗下,不少的企业,未能暴露实际的收益情况 对伯克希尔而言,对盈余的认定并非取决于持股比例是100%,50%,20%,5%或是1%,盈余的真正价值在于其将来再投资所能产生的效益 我们宁愿将所赚的盈余继续交由不受我们控制的人好好发挥,也不希望转由我们自己来浪费 高通货膨胀等于是对投入的资本额外课了一次税 翻译: https://xueqiu.com/6217262310/131837878 https://archive.ph/XMX5n  原文: Buffett’s Letters To Berkshire Shareholders 1980 巴菲特致股东的信 1980 年 Operating earnings improved to $41.9 million in 1980 from $36.0 million in 1979, but return on beginning equity capital (with securities valued at cost) fell to 17.8% from 18.6%. We believe the latter yardstick to be the most appropriate measure of single-year managerial economic performance. Informed use of that yardstick, however, requires an understanding of many factors, including accounting policies, historical ca...

黑客讲述渗透Hacking Team全过程

原文地址在 Freebuf ,后来已经被删除 Wayback Machine 备份 近期,黑客 Phineas Fisher在pastebin.com上讲述了入侵Hacking Team的过程,以下为其讲述的原文情况,文中附带有相关文档、工具及网站的链接,请在安全环境下进行打开,并合理合法使用。作者部分思想较为激进,也请以辩证的观点看待之。 1、序言 在这里,可能你会注意到相比于前面的一个版本,这个版本的内容及语言有了一些变化,因为这将是最后一个版本了 [1] 。对于黑客技术,英语世界中已经有了许多书籍,讲座,指南以及关于黑客攻击的知识。在那个世界,有许多黑客比我优秀,但他们埋没了他们的天赋,而为所谓的“防护”服务商(如Hacking Team之流的),情报机构服务工作。黑客文化作为一项非主流文化诞生于美国,但它现在只保留了它本质的魅力,其他均被同化了。从黑客的本质出发,至少他们可以穿着一件T恤,把头发染成蓝色,用自己的黑客的名字,随意 洒脱 地做着自己喜欢的事件,而当他们为别人(前文所指的 Hacking Team及情报机构 )工作的时候,会感觉自己像个反抗者。 如果按照传统的方式,你不得不潜入办公室偷偷拿到文件[2],或者你不得不持枪抢劫银行。但现在你仅仅需要一台笔记本,躺在床上动动手指便可做得这一切[3][4]。像CNT在入侵伽玛集团(Gamma Group)之后说的,“让我们以一种新的斗争方式向前迈进吧”[5]。 [ 1 ] http: / /pastebin.com/raw .php?i=cRYvK4jb [ 2 ] https: / /en.wikipedia.org/wiki /Citizens%27_Commission_to_Investigate_the_FBI [3] http:/ /www.aljazeera.com/news /2015/ 09/algerian-hacker-hero-hoodlum- 15092108 3914167 .html [ 4 ] https: / /securelist.com/files /2015/ 02 /Carbanak_APT_eng.pdf  [ 5 ] http: / /madrid.cnt.es/noticia /consideraci...

产品爱好者周刊 第26期:PRISM, XKeyscore, Trust No One

  Products Gitea - Git with a cup of tea   https://gitea.io/en-us/ A painless self-hosted Git service. 自建Git服务,避免GitHub隐私侵犯 https://github.com/objective-see/LuLu LuLu is the free macOS firewall 监视Mac的出站流量,且阻断 OverSight   https://github.com/objective-see/OverSight OverSight monitors a mac's mic and webcam, alerting the user when the internal mic is activated, or whenever a process accesses the webcam. 监视是否有应用调用Mac的麦克风、摄像头 Mozilla Hubs   https://github.com/mozilla/hubs The client-side code for Mozilla Hubs, an online 3D collaboration platform that works for desktop, mobile, and VR platforms. 开源的多人虚拟空间,Mozilla打造,企业级VR诉求 数字移民   https://shuziyimin.org 关于内容源、工具的推荐,适合刚接入国际的新人 SimpleLogin   https://simplelogin.io/ 匿名邮箱工具,转发用,Michael Bazzell推荐 Telegram 群组、频道、机器人 - 汇总分享   https://congcong0806.github.io/2018/04/24/Telegram/#机器人-bot https://archive.ph/iJMBj 献给那些将来到Telegram的朋友 Design Patrick Wardle   https://www.instagram.com/patrickwardle/?hl=en 他的IG,摄影也精彩,审美...

360T7 刷机步骤及固件

https://cmi.hanwckf.top/p/360t7-firmware/   360T7的固件支持由immortalwrt-mt798x项目提供支持,请参考: https://cmi.hanwckf.top/p/immortalwrt-mt798x https://github.com/hanwckf/immortalwrt-mt798x 刷机步骤 参考 此处 的办法开启原厂固件的UART和telnet功能 在以下链接下载360T7测试固件(纯净版,无任何插件) https://wwd.lanzout.com/b0bt9idwd 密码:ezex (此固件已过时,请选择其它更新的固件) 接下来将刷入修改版uboot。修改版uboot的优点有: 固件分区可达108MB,原厂uboot只能使用36M 自带一个简单的webui恢复页面 到以下仓库的Release页面下载uboot,目前暂时仅支持360T7,后续将支持更多mt798x路由器。 推荐使用 mt7981_360t7-fip-fixed-parts.bin , fixed-parts 代表uboot分区表在编译期间固定,不会随着uboot环境变量变化。 https://github.com/hanwckf/bl-mt798x/releases/latest 将 mt7981_360t7-fip-fixed-parts.bin 通过HFS等方式上传到路由器,使用以下命令刷入uboot mtd write mt7981_360t7-fip-fixed-parts.bin fip 确认刷入完毕后,拔掉路由器电源。然后将电脑的IP地址设置为固定的 192.168.1.2 ,接着按住路由器的RESET按钮后通电开机,等待8s后用浏览器进入 192.168.1.1 在uboot恢复页面选择要刷入的固件。immortalwrt-mt798x目前编译两个版本的360T7固件。 建议修改版uboot直接使用 immortalwrt-mediatek-mt7981-mt7981-360-t7-108M-squashfs-factory.bin ,两种固件区别如下: mt7981-360-t7-108M 为108M固件分区,原厂uboot不可启动,需要修改版u...

Albert Einstein Said Death Is Not An End Can Prompt You To Find The Meaning and Purpose Of Your Life

原文Link: https://quotationize.com/albert-einstein-said-death-not-end/ 产品随想注: 爱因斯坦对于死亡的观点,深深影响了乔布斯  ---------------- Albert Einstein said death is not an end if we can live on in our children and the younger generation is a line taken from the letter which he wrote to the widow of physicist Heike Kamerlingh Onnes in 1926. Besides death, he also talked about afterlife, immortality and soul. If you have read through my authentic collection of Albert Einstein thoughts on God and religion , you would know that he rejected the formal, dogmatic religion. Einstein did not believe in immortality of the individual. According to him, there is no such thing as, punishment for misdeeds or rewards for good behavior in any afterlife. For him, the so-called Theosophy and Spiritualism, was no more than a symptom of weakness and confusion. As Einstein explained that since our inner experiences consist of reproductions, and combinations of sensory impressions, the concept of a soul with...

产品随想 | 周刊 第85期:e-Residency与数字游民

  David Shambaugh   https://www.google.com/search?q=David+Shambaugh 中国问题研究专家,著作极多 郭玉闪   https://zh.wikipedia.org/wiki/郭玉闪?useskin=vector 中国公共知识分子 我只想好好观影   github.com/BetterWorld-Liuser/autoMovies 刘煜辉:中国资本市场灵魂出窍 最有活力的公司几乎不在A股   https://finance.sina.com.cn/stock/marketresearch/2017-06-23/doc-ifyhmtek7705574.shtml 回看17年的专家讲话,还是挺有水平的,挺多都认可 纽约文化沙龙   https://www.youtube.com/@user-cu2hl5tf6y/videos 视频质量出奇的高,推荐 透视中国政治by吴国光、程晓农 备忘下,貌似评价挺好的一本书 CAPI China Chair Wu Guoguang (吴国光 / 吳國光)   https://www.youtube.com/playlist?list=PLIt1szHhnm_Hso3jGUbfGpnEAbsPOuEVV 因为热爱中国,我们越要看懂中国 AI Canon   https://a16z.com/2023/05/25/ai-canon/ in this post, we’re sharing a curated list of resources we’ve relied on to get smarter about modern AI. We call it the “AI Canon” because these papers, blog posts, courses, and guides have had an outsized impact on the field over the past several years. 希望中国的投資機構,也能有更多的分享與輸出,提升整個社會的認知 Cantonese Font 粵語字體   https://visual-fonts.com/zh/...

SS机场常用服务器线路微普及

原文link:https://www.duyaoss.com/archives/57/   为何写这么个帖子? 更新时间:2019-11-29 由于机场用户增多,很多新用户压根不懂节点上面的名字代表什么,也不知道什么服务器比较适合自己,不懂什么是原生,等等。 所以开一个小帖,稍微介绍一下比较常见的服务器, 专业知识有限,所以只是给小白们介绍一下,其实我也很白,各位大佬见笑了。 在这里尤其感谢 Sukka 苏卡卡大佬和喵酱指导,以及 Nexitally 佩奇提供的资料介绍,否则我真不知道从哪儿开始动笔。后面地区内容都是佩奇帮忙码出来的。时间有限,慢慢再继续填充和修整 本文仅仅是抛砖引玉写一些机场主们告知我的 ISP、IDC 的体验,仅供参考。网络环境每天都在变化,今天飞快的服务器明天有可能龟速,有写的不对或者过时的地方还望大家指正。所以本文也算是一些机场主们把曾经踩过的坑分享给大家吧。(本来是想给小白写服务器介绍的,佩奇大佬写着写着就专业惯性的转到了商家哈哈哈,这是一个悲伤的故事) 测速图 Telegram 频道: https://t.me/DuyaoSS 主用链接: DuyaoSS - 毒药机场简介博客 常见名词: IPLC: "International Private Leased Circuit"的缩写,即“国际专线”。不过大部分机场通常看到的iplc,都只是阿里的经典网络,跨数据中心内网互通,阿里内网,并不是严格意义的iplc专线;当然也有其他渠道的,或真iplc,不过比较少。阿里云的内网互通底层原理是通过采购多个点对点的iplc专线,来连接各个数据中心,从而把各个数据中心纳入到自己的一套内网里面来。这样做有两个好处,其一是iplc链路上的带宽独享,完全不受公网波动影响,其二是过境的时候不需要经过GFW,确保了数据安全且不受外界各种因素干扰。但是需要注意一下阿里云的iplc也是有带宽上限的,如果过多的人同时挤到同一条专线上,峰值带宽超过专线的上限的话也同样会造成网络不稳定。其他渠道购买到的iplc价格很高,阿里云内网这种性价比超高这种好东西且用且珍惜。 IEPL国际以太网专线(International Ethernet Private Line,简称IEPL),构建于MSTP设备平台上...

Interview at the All Things Digital D5 Conference, Steve and Bill Gates spoke with journalists Kara Swisher and Walt Mossberg onstage in May 2007.

Kara Swisher: The first question I was interested in asking is what you think each has contributed to the computer and technology industry— starting with you, Steve, for Bill, and vice versa. Steve Jobs: Well, Bill built the first software company in the industry. And I think he built the first software company before anybody really in our industry knew what a software company was, except for these guys. And that was huge. That was really huge. And the business model that they ended up pursuing turned out to be the one that worked really well for the industry. I think the biggest thing was, Bill was really focused on software before almost anybody else had a clue that it was really the software that— KS: Was important? SJ: That’s what I see. I mean, a lot of other things you could say, but that’s the high-order bit. And I think building a company’s really hard, and it requires your greatest persuasive abilities to hire the best ...

A Sister’s Eulogy for Steve Jobs

I grew up as an only child, with a single mother. Because we were poor and because I knew my father had emigrated from Syria, I imagined he looked like Omar Sharif. I hoped he would be rich and kind and would come into our lives (and our not yet furnished apartment) and help us. Later, after I’d met my father, I tried to believe he’d changed his number and left no forwarding address because he was an idealistic revolutionary, plotting a new world for the Arab people. Even as a feminist, my whole life I’d been waiting for a man to love, who could love me. For decades, I’d thought that man would be my father. When I was 25, I met that man and he was my brother. By then, I lived in New York, where I was trying to write my first novel. I had a job at a small magazine in an office the size of a closet, with three other aspiring writers. When one day a lawyer called me — me, the middle-class girl from California who hassled the boss to buy us health insurance — and said his cl...

Interview with Steve Jobs, WGBH, 1990

Interviewer: what is it about this machine? Why is this machine so interesting? Why has it been so influential? Jobs: Ah ahm, I'll give you my point of view on it. I remember reading a magazine article a long time ago ah when I was ah twelve years ago maybe, in I think it was Scientific American . I'm not sure. And the article ahm proposed to measure the efficiency of locomotion for ah lots of species on planet earth to see which species was the most efficient at getting from point A to point B. Ah and they measured the kilocalories that each one expended. So ah they ranked them all and I remember that ahm...ah the Condor, Condor was the most efficient at [CLEARS THROAT] getting from point A to point B. And humankind, the crown of creation came in with a rather unimpressive showing about a third of the way down...