摘要:一位开发人员每天跑相同的任务,但是他发现某一天同样的任务处理大约差不多量的数据(都是大约4亿条,hive count的结果),处理时间却相差3倍之多,一个不到一个小时,一个运行了三个小时,记录一下分析步骤。
一位开发人员每天跑相同的任务,但是他发现某一天同样的任务处理大约差不多量的数据(都是大约4亿条,hive count的结果),处理时间却相差3倍之多,一个不到一个小时,一个运行了三个小时,记录一下分析步骤。
1. 首先用explain看一下hive语句执行步骤再去找对应的job任务。 可以看出此sql语句分4个任务去跑。
Stage-1:执行join操作
Stage-2:根据Stage-1 执行count操作
Stage-0:数据移动到新表的操作。
[sql] view plain copy
1. STAGE DEPENDENCIES:
2. Stage-1 is a root stage
3. Stage-2 depends on stages: Stage-1
4. Stage-0 depends on stages: Stage-2
5. Stage-3 depends on stages: Stage-0
6.
7. STAGE PLANS:
8. Stage: Stage-1
9. Map Reduce
10. Alias -> Map Operator Tree:
11. m
12. TableScan
13. alias: m
14. Reduce Output Operator
15. key expressions:
16. expr: rtb_label
17. type: string
18. sort order: +
19. Map-reduce partition columns:
20. expr: rtb_label
21. type: string
22. tag: 0
23. value expressions:
24. expr: label_id
25. type: int
26. expr: extract
27. type: string
28. t:rtb_requests
29. TableScan
30. alias: rtb_requests
31. Filter Operator
32. predicate:
33. expr: (tid <> '')
34. type: boolean
35. Lateral View Forward
36. Select Operator
37. expressions:
38. expr: tid
39. type: string
40. expr: opxpid
41. type: string
42. expr: request_date_i
43. type: string
44. outputColumnNames: tid, opxpid, request_date_i
45. Lateral View Join Operator
46. outputColumnNames: _col12, _col16, _col40, _col42
47. Select Operator
48. expressions:
49. expr: _col12
50. type: string
51. expr: _col16
52. type: string
53. expr: _col42
54. type: string
55. outputColumnNames: _col0, _col1, _col2
56. Reduce Output Operator
57. key expressions:
58. expr: _col2
59. type: string
60. sort order: +
61. Map-reduce partition columns:
62. expr: _col2
63. type: string
64. tag: 1
65. value expressions:
66. expr: _col0
67. type: string
68. expr: _col1
69. type: string
70. Select Operator
71. expressions:
72. expr: array(detected_content_labels[0],CASE WHEN ((user_agent is null or (user_agent = ''))) THEN ('device_Undefined') WHEN (((user_agent rlike 'iP(a|ro)d') or (user_agent rlike 'Tablet'))) THEN ('device_Tablet') WHEN ((((user_agent rlike 'Kindle') or (user_agent rlike 'Mac.OS')) and (user_agent rlike 'Silk'))) THEN ('device_Tablet') WHEN ((((user_agent rlike 'HTC(.Flyer|\_Flyer)|ViewPad7|Dell.Streak.7|Next2|nook') and (user_agent rlike 'RUTEM/')) and (user_agent rlike 'Silk'))) THEN ('device_Tablet') WHEN ((((user_agent rlike 'Opera') and (user_agent rlike 'Windows.NT.5')) and (user_agent rlike 'HTC|Xda|Mini|Vario|SAMSUNG-GT-i8000|SAMSUNG-SGH-i9'))) THEN ('device_Mobile') WHEN ((((user_agent rlike 'Windows') and (not (user_agent rlike 'Phone'))) or (user_agent rlike 'Win(9|.9|NT)'))) THEN ('device_Computer') WHEN (((user_agent rlike 'Macintosh|PowerPC') and (not (user_agent rlike 'Silk')))) THEN ('device_Computer') WHEN ((user_agent rlike 'Android|Phone|Mobile')) THEN ('device_Mobile') WHEN (((user_agent rlike 'Linux') and (user_agent rlike 'X11'))) THEN ('device_Computer') WHEN ((user_agent rlike 'Solaris|SunOS|OpenBSD|FreeBSD')) THEN ('device_Computer') ELSE ('device_Other') END,CASE WHEN (((user_agent = '') or user_agent is null)) THEN ('browser_Undefined') WHEN ((user_agent rlike 'Chrome/[0-9]{1,2}')) THEN ('browser_Chrome') WHEN ((user_agent rlike 'MSIE [0-9]{1,2}.[0-9]+')) THEN ('browser_Internet Explorer') WHEN ((user_agent rlike 'Firefox/[0-9]{1,2}')) THEN ('browser_Firefox') WHEN ((user_agent rlike 'Opera Mini/[0-9]{1,2}')) THEN ('browser_Opera Mini') WHEN ((user_agent rlike 'Opera/[0-9]{1,2}')) THEN ('browser_Opera') WHEN (((user_agent rlike 'Version/[0-9]{1,2}') and (user_agent rlike 'Safari/[0-9]+'))) THEN ('browser_Safari') ELSE ('browser_Other') END,CASE WHEN ((user_agent is null or (user_agent = ''))) THEN ('os_Undefined') WHEN (((user_agent rlike 'iPad') or (user_agent rlike 'iPhone'))) THEN ('os_iOS') WHEN ((user_agent rlike 'Android')) THEN ('os_Android') WHEN (((user_agent rlike 'Windows') or (user_agent rlike 'Win(9|.9|NT)'))) THEN ('os_Windows') WHEN (((user_agent rlike 'Macintosh|PowerPC') and (user_agent rlike 'Mac OS'))) THEN ('os_Mac OS') WHEN (((user_agent rlike 'Linux') and (user_agent rlike 'X11'))) THEN ('os_Linux') WHEN ((user_agent rlike 'Solaris|SunOS|OpenBSD|FreeBSD')) THEN ('os_Other') ELSE ('os_Other') END,CASE WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '0000') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '0059'))) THEN ('00_01') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '0100') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '0159'))) THEN ('01_02') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '0200') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '0259'))) THEN ('02_03') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '0300') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '0359'))) THEN ('03_04') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '0400') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '0459'))) THEN ('04_05') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '0500') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '0559'))) THEN ('05_06') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '0600') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '0659'))) THEN ('06_07') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '0700') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '0759'))) THEN ('07_08') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '0800') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '0859'))) THEN ('08_09') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '0900') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '0959'))) THEN ('09_10') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '1000') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '1059'))) THEN ('10_11') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '1100') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '1159'))) THEN ('11_12') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '1200') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '1259'))) THEN ('12_13') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '1300') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '1359'))) THEN ('13_14') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '1400') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '1459'))) THEN ('14_15') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '1500') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '1559'))) THEN ('15_16') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '1600') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '1659'))) THEN ('16_17') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '1700') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '1759'))) THEN ('17_18') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '1800') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '1859'))) THEN ('18_19') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '1900') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '1959'))) THEN ('19_20') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '2000') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '2059'))) THEN ('20_21') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '2100') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '2159'))) THEN ('21_22') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '2200') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '2259'))) THEN ('22_23') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '2300') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '2359'))) THEN ('23_00') END,CASE WHEN (((country = 'China') and (city is not null and (city <> '')))) THEN (city) WHEN (((country = 'China') and ((city = '') or city is null))) THEN ('China') ELSE (country) END)
73. type: array<string>
74. outputColumnNames: _col0
75. UDTF Operator
76. function name: explode
77. Lateral View Join Operator
78. outputColumnNames: _col12, _col16, _col40, _col42
79. Select Operator
80. expressions:
81. expr: _col12
82. type: string
83. expr: _col16
84. type: string
85. expr: _col42
86. type: string
87. outputColumnNames: _col0, _col1, _col2
88. Reduce Output Operator
89. key expressions:
90. expr: _col2
91. type: string
92. sort order: +
93. Map-reduce partition columns:
94. expr: _col2
95. type: string
96. tag: 1
97. value expressions:
98. expr: _col0
99. type: string
100. expr: _col1
101. type: string
102. Reduce Operator Tree:
103. Join Operator
104. condition map:
105. Inner Join 0 to 1
106. condition expressions:
107. 0 {VALUE._col1} {VALUE._col2}
108. 1 {VALUE._col0} {VALUE._col1}
109. handleSkewJoin: false
110. outputColumnNames: _col1, _col2, _col5, _col6
111. Select Operator
112. expressions:
113. expr: _col5
114. type: string
115. expr: _col6
116. type: string
117. expr: _col1
118. type: int
119. expr: _col2
120. type: string
121. outputColumnNames: _col5, _col6, _col1, _col2
122. Group By Operator
123. aggregations:
124. expr: count(_col1)
125. bucketGroup: false
126. keys:
127. expr: _col5
128. type: string
129. expr: _col6
130. type: string
131. expr: _col1
132. type: int
133. expr: _col2
134. type: string
135. mode: hash
136. outputColumnNames: _col0, _col1, _col2, _col3, _col4
137. File Output Operator
138. compressed: false
139. GlobalTableId: 0
140. table:
141. input format: org.apache.hadoop.mapred.SequenceFileInputFormat
142. output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
143.
144. Stage: Stage-2
145. Map Reduce
146. Alias -> Map Operator Tree:
147. maprfs:/user/hdbatch/tmp/hive/hive_2013-10-12_15-25-43_516_746550569391015319/-mr-10002
148. Reduce Output Operator
149. key expressions:
150. expr: _col0
151. type: string
152. expr: _col1
153. type: string
154. expr: _col2
155. type: int
156. expr: _col3
157. type: string
158. sort order: ++++
159. Map-reduce partition columns:
160. expr: _col0
161. type: string
162. expr: _col1
163. type: string
164. expr: _col2
165. type: int
166. expr: _col3
167. type: string
168. tag: -1
169. value expressions:
170. expr: _col4
171. type: bigint
172. Reduce Operator Tree:
173. Group By Operator
174. aggregations:
175. expr: count(VALUE._col0)
176. bucketGroup: false
177. keys:
178. expr: KEY._col0
179. type: string
180. expr: KEY._col1
181. type: string
182. expr: KEY._col2
183. type: int
184. expr: KEY._col3
185. type: string
186. mode: mergepartial
187. outputColumnNames: _col0, _col1, _col2, _col3, _col4
188. Select Operator
189. expressions:
190. expr: _col0
191. type: string
192. expr: _col2
193. type: int
194. expr: _col4
195. type: bigint
196. expr: _col3
197. type: string
198. expr: _col1
199. type: string
200. outputColumnNames: _col0, _col1, _col2, _col3, _col4
201. Select Operator
202. expressions:
203. expr: _col0
204. type: string
205. expr: _col1
206. type: int
207. expr: UDFToInteger(_col2)
208. type: int
209. expr: _col3
210. type: string
211. expr: _col4
212. type: string
213. outputColumnNames: _col0, _col1, _col2, _col3, _col4
214. File Output Operator
215. compressed: false
216. GlobalTableId: 1
217. table:
218. input format: org.apache.hadoop.mapred.TextInputFormat
219. output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
220. serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
221. name: tanx_rtb.cookie_labels
222.
223. Stage: Stage-0
224. Move Operator
225. tables:
226. partition:
227. request_date_i 20131010
228. replace: true
229. table:
230. input format: org.apache.hadoop.mapred.TextInputFormat
231. output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
232. serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
233. name: tanx_rtb.cookie_labels
234.
235. Stage: Stage-3
236. Stats-Aggr Operator
2. 去hadoop监控页面查找任务。可以看出两者运行时间相差3倍以上
Stage2:
Stage6:
3. 分析了一下Stage6和stage2的任务:可以看出map input的数据根本就不一样,相差3倍之多,这就很奇怪了,因为我们用hive count 得的结果条数基本一致,为什么map input 输入就相差这么多呢?
任务1:
任务2:
4.接着分析去查找这两个任务的map input 输入路径:发现两者的数据路径任务一路径显然多,而任务二显然少,于是去HDFS查找发现 MR的input path 和HDFS的数据不一样(有延迟数据).
您输入的评论内容中包含违禁敏感词
我知道了
请输入正确的手机号码
请输入正确的验证码
您今天的短信下发次数太多了,明天再试试吧!
我们会在第一时间安排职业规划师联系您!
您也可以联系我们的职业规划师咨询:
版权所有 职坐标-一站式IT培训就业服务领导者 沪ICP备13042190号-4
上海海同信息科技有限公司 Copyright ©2015 www.zhizuobiao.com,All Rights Reserved.
沪公网安备 31011502005948号