workload not able to complete

classic Classic list List threaded Threaded
6 messages Options
Lan
Reply | Threaded
Open this post in threaded view
|

workload not able to complete

Lan
I ran into some workloads with heavy IOs that are unable to complete without any errors. For example, this S3 workload has started its main stage without finishing:


2013-11-28 00:38:25,644 [INFO] [WorkloadProcessor] - begin to process workload w691
2013-11-28 00:38:25,646 [INFO] [WorkloadProcessor] - begin to run stage s1
2013-11-28 00:38:26,379 [INFO] [StageRunner] - successfully booted all tasks in stage s1
2013-11-28 00:38:26,946 [INFO] [StageRunner] - successfully submitted all tasks in stage s1
2013-11-28 00:38:26,972 [INFO] [MissionHandler] - mission M49C247CD3E has been authed successfully
2013-11-28 00:38:26,975 [INFO] [StageRunner] - successfully authenticated all tasks in stage s1
2013-11-28 00:38:27,185 [INFO] [StageRunner] - successfully launched all tasks in stage s1
2013-11-28 00:38:27,719 [INFO] [MissionHandler] - mission M49C247CD3E has been executed successfully
2013-11-28 00:38:32,207 [INFO] [StageRunner] - successfully queied all tasks in stage s1
2013-11-28 00:38:32,227 [INFO] [MissionHandler] - mission M49C247CD3E has been closed successfully
2013-11-28 00:38:32,494 [INFO] [StageRunner] - successfully closed all tasks in stage s1
2013-11-28 00:38:34,688 [INFO] [WorkloadProcessor] - successfully ran stage s1
2013-11-28 00:38:34,689 [INFO] [WorkloadProcessor] - begin to run stage s2
2013-11-28 00:38:34,694 [INFO] [StageRunner] - successfully booted all tasks in stage s2
2013-11-28 00:38:34,721 [INFO] [StageRunner] - successfully submitted all tasks in stage s2
2013-11-28 00:38:34,724 [INFO] [MissionHandler] - mission M09C249C895 has been authed successfully
2013-11-28 00:38:34,725 [INFO] [StageRunner] - successfully authenticated all tasks in stage s2
2013-11-28 00:38:34,930 [INFO] [StageRunner] - successfully launched all tasks in stage s2
2013-11-28 00:38:34,981 [INFO] [MissionHandler] - mission M09C249C895 has been executed successfully
2013-11-28 00:38:39,947 [INFO] [StageRunner] - successfully queied all tasks in stage s2
2013-11-28 00:38:39,960 [INFO] [MissionHandler] - mission M09C249C895 has been closed successfully
2013-11-28 00:38:40,049 [INFO] [StageRunner] - successfully closed all tasks in stage s2
2013-11-28 00:38:42,431 [INFO] [WorkloadProcessor] - successfully ran stage s2
2013-11-28 00:38:42,431 [INFO] [WorkloadProcessor] - begin to run stage s3
2013-11-28 00:38:42,439 [INFO] [StageRunner] - successfully booted all tasks in stage s3
2013-11-28 00:38:42,853 [INFO] [StageRunner] - successfully submitted all tasks in stage s3
2013-11-28 00:38:42,866 [INFO] [MissionHandler] - mission M29C24BD4C6 has been authed successfully
2013-11-28 00:38:42,867 [INFO] [StageRunner] - successfully authenticated all tasks in stage s3
2013-11-28 00:38:43,255 [INFO] [StageRunner] - successfully launched all tasks in stage s3
2013-11-28 00:45:57,101 [WARN] [OperatorContext] - heavy atomic op overhead detected: 0.822519ms
2013-11-28 00:46:54,504 [WARN] [OperatorContext] - heavy atomic op overhead detected: 3.911689ms
2013-11-28 00:47:54,248 [WARN] [OperatorContext] - heavy atomic op overhead detected: 0.687465ms
2013-11-28 00:49:34,024 [WARN] [OperatorContext] - heavy atomic op overhead detected: 53.208193ms
2013-11-28 00:49:35,048 [WARN] [OperatorContext] - heavy atomic op overhead detected: 0.906275ms
2013-11-28 00:50:52,741 [WARN] [OperatorContext] - heavy atomic op overhead detected: 0.593473ms
2013-11-28 00:50:55,819 [WARN] [OperatorContext] - heavy atomic op overhead detected: 0.750748ms


Log file for mission M29C24BD4C6 (main stage) doesn't have any warnings or error. It simply logged each object created.

Is there a timeout for the total run time of a workload?

Thanks,
Lan
Lan
Reply | Threaded
Open this post in threaded view
|

Re: workload not able to complete

Lan
Attaching the workload configuration file:

    <workstage name="init">
              <work type="init" workers="1" config="cprefix=w1;containers=r(1,1)" />
                            </workstage>

    <workstage name="prepare">
              <work type="prepare" workers="1" config="cprefix=w1;containers=r(1,1);objects=r(1,1);sizes=c(1)MB" />
                            </workstage>

    <workstage name="main">
              <work name="main" workers="40" runtime="900">
                         <operation type="write" ratio="100" config="cprefix=w1;containers=u(1,1);objects=u(1,10000);sizes=c(1)MB" />
                                   </work>
                                       </workstage>

    <workstage name="cleanup">
              <work type="cleanup" workers="40" config="cprefix=w1;containers=r(1,1);objects=r(1,10000)" />
                            </workstage>

    <workstage name="dispose">
              <work type="dispose" workers="1" config="cprefix=w1;containers=r(1,1)" />
                            </workstage>

Thanks,
Lan
Reply | Threaded
Open this post in threaded view
|

Re: workload not able to complete

ywang19
Administrator
In reply to this post by Lan
Lan,

In main stage, the total run time is set to "900" seconds, it means after 900 seconds even not all 10000 objects are written. if you expect to create all 10000 objects in sure, you should create then in prepare stage. BTW, even you extend the runtime in main stage, it doesn't grantee all those 10000 objects to be created, as object to be created is randomly selected from #1 to #10000, some objects may be created multiple times, and some may not have chance to be selected.

-yaguang





Reply | Threaded
Open this post in threaded view
|

Re: workload not able to complete

birdsview
Hi Yaguang,

Thank you for your reply. I understand that only a subset of the objects in the range will be created since it's random selection within the specified time span. The problem I encountered is the program quited without finishing the main stage. The log file only showed when the main stage started properly, it didn't print any further progress of the workload, and the program terminated.

This seems only happens to relatively large object space, in this example, 10000 objects. I reran the same workload, sometimes it can complete, sometimes it can't. If I increase the object space to even larger, it's mostly likely the workload can never finish normally. Is there other log to look at in addition to system.log and logs under missions sub folder?

Regards,
Lan
Reply | Threaded
Open this post in threaded view
|

Re: workload not able to complete

ywang19
Administrator
One log file called system.log under "log" folder should have messages, could you post the part correlated to your run?
Lan
Reply | Threaded
Open this post in threaded view
|

Re: workload not able to complete

Lan
The content of system.log is posted in the original post. Thanks,