laiahu

浏览: 31904 次
性别:
来自: 北京

最近访客更多访客>>

uuhui

桃花岛黄老邪

qindongliang1922

woodding2008

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Using Hadoop Distributed Cache

博客分类：

hadoop

hadoop cache

Hadoop has a distributed cache mechanism to make available file locally that may be needed by Map/Reduce jobs. This post tried to expand a bit more on the information provided by the javadoc ofDistributedCache

Use Case

Lets understand our Use Case a bit more in details so that we can follow-up the code snippets.
We have a Key-Value file that we need to use in our Map jobs. For simplicity, lets say we need to replace all keywords that we encounter during parsing, with some other value.

So what we need is

A key-values files (Lets use a Properties files)
The Mapper code that uses the code

Step 1

Place the key-values file on the HDFS

1.hadoop fs -put ./keyvalues.properties cache/keyvalues.properties

This path is relative to the user's home folder on HDFS

Step 2

Write the Mapper code that uses it

publicclassDistributedCacheMapperextendsMapper<LongWritable, Text, Text, Text> {


Properties cache;


@Override

protectedvoidsetup(Context context)throwsIOException, InterruptedException {

super.setup(context);

Path[] localCacheFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration());


if(localCacheFiles !=null) {

// expecting only single file here

for(inti =0; i < localCacheFiles.length; i++) {

Path localCacheFile = localCacheFiles[i];

cache =newProperties();

cache.load(newFileReader(localCacheFile.toString()));

}

}else{

// do your error handling here

}


}


@Override

publicvoidmap(LongWritable key, Text value, Context context)throwsIOException, InterruptedException {

// use the cache here

// if value contains some attribute, cache.get(<value>)

// do some action or replace with something else

}


}

Mapper code is simple enough. During the setup phase, we read the file and populate the Properties object. And inside the map() we use the cache to lookup for certain keys and replace them, if they are present.

Step 3

Add the properties file to your driver code

JobConf jobConf =newJobConf();

// set job properties

// set the cache file

DistributedCache.addCacheFile(newURI("cache/keyvalues.properties#keyvalues.properties"), jobConf);

一些资料：

DistributedCache

DistributedCache可将具体应用相关的、大尺寸的、只读的文件有效地分布放置。

DistributedCache是Map/Reduce框架提供的功能，能够缓存应用程序所需的文件（包括文本，档案文件，jar文件等）。

应用程序在JobConf中通过url(hdfs://)指定需要被缓存的文件。DistributedCache假定由hdfs://格式url指定的文件已经在FileSystem上了。

Map-Redcue框架在作业所有任务执行之前会把必要的文件拷贝到slave节点上。它运行高效是因为每个作业的文件只拷贝一次并且为那些没有文档的slave节点缓存文档。

DistributedCache根据缓存文档修改的时间戳进行追踪。在作业执行期间，当前应用程序或者外部程序不能修改缓存文件。

distributedCache可以分发简单的只读数据或文本文件，也可以分发复杂类型的文件例如归档文件和jar文件。归档文件(zip,tar,tgz和tar.gz文件)在slave节点上会被解档（un-archived）。这些文件可以设置执行权限。

用户可以通过设置mapred.cache.{files|archives}来分发文件。如果要分发多个文件，可以使用逗号分隔文件所在路径。也可以利用API来设置该属性：DistributedCache.addCacheFile(URI,conf)/DistributedCache.addCacheArchive(URI,conf)andDistributedCache.setCacheFiles(URIs,conf)/DistributedCache.setCacheArchives(URIs,conf)其中URI的形式是hdfs://host:port/absolute-path#link-name在Streaming程序中，可以通过命令行选项-cacheFile/-cacheArchive分发文件。

用户可以通过DistributedCache.createSymlink(Configuration)方法让DistributedCache在当前工作目录下创建到缓存文件的符号链接。或者通过设置配置文件属性mapred.create.symlink为yes。分布式缓存会截取URI的片段作为链接的名字。例如，URI是hdfs://namenode:port/lib.so.1#lib.so，则在task当前工作目录会有名为lib.so的链接，它会链接分布式缓存中的lib.so.1。

DistributedCache可在map/reduce任务中作为一种基础软件分发机制使用。它可以被用于分发jar包和本地库（native libraries）。DistributedCache.addArchiveToClassPath(Path, Configuration)和DistributedCache.addFileToClassPath(Path, Configuration)API能够被用于缓存文件和jar包，并把它们加入子jvm的classpath。也可以通过设置配置文档里的属性mapred.job.classpath.{files|archives}达到相同的效果。缓存文件可用于分发和装载本地库。

分享到：

mongodb shell 无法删除问题 | Partitioner, SortComparator and Grouping ...

2012-06-20 10:33
浏览 368
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Using Hadoop Distributed Cache

Use Case

Step 1

Step 2

Step 3

DistributedCache

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Using Hadoop Distributed Cache

Use Case

Step 1

Step 2

Step 3

DistributedCache

评论

发表评论

相关推荐

利用位映射原理对大数据排重

转载 Hadoop：用还是不用？

Partitioner, SortComparator and GroupingComparator in Hadoop

(转)十道海量数据处理面试题与十个方法大总结

(转)HBase技术介绍

(转)Zookeeper全解析——Paxos作为灵魂

CDH3 Installation Guide

(转)HBase Installation

ZooKeeper Installation

（转）HBase 官方文档__中文版

(转)HBase技术介绍

深入理解Bloom Filter

bloom filter的开源实现程序memcached bloom filter

最近访客更多访客>>