我正在尝试设置一个简单的EMR作业来执行大量文本文件的字数统计,存储在s3:// __ mybucket __ / input /中.我无法正确添加两个所需的流步骤中的第一步(第一个是mapSplitter.py的地图输入,使用IdentityReducer减少到临时存储;第二步是使用/ bin / wc /映射此二级存储的内容,并再次使用IdentityReducer进行减少).
这是第一步的(失败)描述:
Status:FAILED
Reason:S3 Service Error.
Log File:s3://aws-logs-209733341386-us-east-1/elasticmapreduce/j-2XC5AT2ZP48FJ/steps/s-1SML7U7CXRDT5/stderr.gz
Details:Exception in thread "main" com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 7799087FCAE73457), S3 Extended Request ID: nQYTtW93TXvi1G8U4LLj73V1xyruzre+uSt4KN1zwuIQpwDwa+J8IujOeQMpV5vRHmbuKZLasgs=
JAR location: command-runner.jar
Main class: None
Arguments: hadoop-streaming -files s3://elasticmapreduce/samples/wordcount/wordSplitter.py -mapper wordSplitter.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input s3://__mybucket__/input/ -output s3://__mybucket__/output/
Action on failure: Continue
这是发送到hadoop集群的命令:
JAR location : command-runner.jar
Main class : None
Arguments : hadoop-streaming -mapper s3a://elasticmapreduce/samples/wordcount/wordSplitter.py -reducer aggregate -input s3a://__my_bucket__/input/ -output s3a://__my_bucket__/output/
最佳答案 我认为这里的解决方案可能非常简单.
而不是s3://使用s3a://作为访问存储桶的作业的方案.
请参阅here,不推荐使用s3://方案,并且要求相关存储桶对您的Hadoop数据是独占的.从上面的doc链接引用:
This filesystem requires you to dedicate a bucket for the filesystem –
you should not use an existing bucket containing files, or write other
files to the same bucket. The files stored by this filesystem can be
larger than 5GB, but they are not interoperable with other S3 tools.