Saturday, January 21, 2017

Using SAP HANA Cloud Plattform from ABAP to find root causes for runtime bottlenecks

Hi guys,

continuing my previous experiments, I have now set aside AWS Machine Learning for a while and turned towards the SAP HANA Cloud Platform. SAP HCP also offers Machine Learning capabilities under the label of "Predictive Analysis". I was reading a bit and found out about additional capabilities to the prediction like introspection mechanisms on the models you create within SAP HCP Predictive Analysis like, finding key influencing factors for target figures in your data. This so far is not offered by Amazon's Machine Learning, so it got me excited.

Goals

The goal of this experiment is to enable every ABAP developer to use SAP HCP Predictive Analysis via an native API layer without the hassle of having to understanding architecture, infrastructure and details of the implementation. This will bring the capabilities much closer to where the hidden data treasure lies, waiting for insights: The ERP System - even if it's not S/4 HANA or even sitting on top of a HANA database itself.

This ultimately enables ABAP developers to extend existing business functionality with Predictive Analysis capabilities.




Use Case

I decided to keep topic the same as in my previous articles: Predicting the runtime of the SNP System Scan. This time however I wanted to take advantage of the introspection capabilities and ask what factors influences the runtime the most.

I am going in with an expectation that the runtime will most likely depend on the database, the database vendor and version. Of course the version should play a significant role, as we are contantly trying to improve performance. However this may be in a battle with features we are adding. Also the industry could be of interest because this may lead to different areas of the database being popuplated with different amounts of data, due to different processes being used. Let's see what HCP fiends out.

Preparation

In order to make use of SAP HCP Predictive Services you have to fulfill some prerequisites. Rather than describing them completely I will reference the tutorials, I was following to do the same within my company. If you already have setup an SAP HCP account and the SAP HANA Cloud Connector you only need to perform steps 2, 4 and 5.
  1. Create an SAP HCP trial account.
  2. Deploy SAP HANA Cloud Platform predictive services on your trial account by following (see this tutorial)
  3. Install and setup the SAP HANA Cloud Connector in your corporate system landscape
  4. Configure access in the SAP HANA Cloud Connector to the HANA database you have set up in step 2.
  5. This makes your HCP database appear to be in your own network and you can configure it as an ADBC resource in transaction DBCO of your SAP NetWeaver system. That you will execute your ABAP code on later.

Architecture

After having set up the infrastructure let's think about the architecture of the application. It's not one of the typical extension patterns, that SAP forsees because the application logic resides on the ABAP system, making use of an application in the cloud rather than a new application sitting in SAP HCP that is treating your on-premise SAP systems as datasources.



Example Implementation

So by this time all necessary prerequisites are fulfilled and it's time to have some fun with ABAP - hard to believe ;-) But as long as you can build an REST-Client you can extend your core with basically anything you can imagine nowadays. So here we go:

FORM main.
*"--- DATA DEFINITION -------------------------------------------------
  DATA: lr_scan_data TYPE REF TO data.
  DATA: lr_prepared_data TYPE REF TO data.
  DATA: lr_ml TYPE REF TO /snp/hcp01_cl_ml.
  DATA: lv_dataset_id TYPE i.
  DATA: lr_ex TYPE REF TO cx_root.
  DATA: lv_msg TYPE string.

  FIELD-SYMBOLS: <lt_data> TYPE table.

*"--- PROCESSING LOGIC ------------------------------------------------
  TRY.
      "fetch the data into an internal table
      PERFORM get_system_scan_data CHANGING lr_scan_data.
      ASSIGN lr_scan_data->* TO <lt_data>.

      "prepare data (e.g. convert, select features)
      PERFORM prepare_data USING <lt_data> CHANGING lr_prepared_data.
      ASSIGN lr_prepared_data->* TO <lt_data>.

      "create a dataset (called a model on other platforms like AWS)
      CREATE OBJECT lr_ml.
      PERFORM create_dataset USING lr_ml <lt_data> CHANGING lv_dataset_id.

      "check if...
      IF lr_ml->is_ready( lv_dataset_id ) = abap_true.

        "...creation was successful
        PERFORM find_key_influencers USING lr_ml lv_dataset_id.

      ELSEIF lr_ml->is_failed( lv_dataset_id ) = abap_true.

        "...creation failed
        lv_msg = /snp/cn00_cl_string_utils=>text( iv_text = 'Model &1 has failed' iv_1 = lv_dataset_id ).
        MESSAGE lv_msg TYPE 'S' DISPLAY LIKE 'E'.

      ENDIF.

    CATCH cx_root INTO lr_ex.

      "output errors
      lv_msg = lr_ex->get_text( ).
      PERFORM display_lines USING lv_msg.

  ENDTRY.

ENDFORM.
This is basically the same procedure as last time, when connecting AWS Machine Learning. Again I was fetching the data via a REST Service from the SNP Data Cockpit instance I am using to keep statistics on all executed SNP System Scans. However, you can basically fetch your data that will be used as a data source for your model in any way that you like. Most probably you will be using OpenSQL SELECTs to fetch the data accordingly. Just as a reminder, the results looked somewhat like this:


Prepare the Data

This is the raw data and it's not perfect! The data quality is not quite good and in the shape that it's in. According to this article there are some improvements that I need to do in order to improve its quality.

  1. Normalizing values (e.g. lower casing, mapping values or clustering values). E.g.
    • Combining the database vendor and the major version of the database because those two values only make sense when treated in combination and not individually
    • Clustering the database size to 1.5TB chunks as these values can be guessed easier when executing predictions
    • Clustering the runtime into exponentially increasing categories does not work with HCP Predictive Services as you only solve regression problems so far that rely on numeric values.
  2. Filling up empty values with reasonable defaults. E.g.
    • treating all unknown SAP client types as test clients
  3. Make values and field names more human readable. This is not necessary for the machine learning algorithms, but it makes for better manual result interpretation
  4. Removing fields that do not make good features, like 
    • IDs
    • fields that cannot be provided for later predictions, because values cannot be determined easily or intuitively
  5. Remove records that still do not have good data quality. E.g. missing values in
    • database vendors
    • SAP system types
    • customer industry
  6. Remove records that are not representative. E.g. 
    • they refer to scans with exceptionally short runtimes probably due to intentionally limiting the scope
    • small database sizes that are probably due to non productive systems
So the resulting coding to do this preparation and data cleansing looks almost the same as in the AWS Example:

FORM prepare_data USING it_data TYPE table CHANGING rr_data TYPE REF TO data.
*"--- DATA DEFINITION -------------------------------------------------
  DATA: lr_q TYPE REF TO /snp/cn01_cl_itab_query.

*"--- PROCESSING LOGIC ------------------------------------------------
  CREATE OBJECT lr_q.

  "selecting the fields that make good features
  lr_q->select( iv_field = 'COMP_VERSION'       iv_alias = 'SAP_SYSTEM_TYPE' ).
  lr_q->select( iv_field = 'DATABASE'           iv_uses_fields = 'NAME,VERSION' iv_cb_program = sy-repid iv_cb_form = 'ON_VIRTUAL_FIELD' ).
  lr_q->select( iv_field = 'DATABASE_SIZE'      iv_uses_fields = 'DB_USED' iv_cb_program = sy-repid iv_cb_form = 'ON_VIRTUAL_FIELD' ).
  lr_q->select( iv_field = 'OS'                 iv_alias = 'OPERATING_SYSTEM' ).
  lr_q->select( iv_field = 'SAP_CLIENT_TYPE'    iv_uses_fields = 'CCCATEGORY' iv_cb_program = sy-repid iv_cb_form = 'ON_VIRTUAL_FIELD'  ).
  lr_q->select( iv_field = 'COMPANY_INDUSTRY1'  iv_alias = 'INDUSTRY' ).
  lr_q->select( iv_field = 'IS_UNICODE'         iv_cb_program = sy-repid iv_cb_form = 'ON_VIRTUAL_FIELD' ).
  lr_q->select( iv_field = 'SCAN_VERSION' ).
  lr_q->select( iv_field = 'RUNTIME_MINUTES'    iv_ddic_type = 'INT4' ). "make sure this column is converted into a number

  "perform the query on the defined internal table
  lr_q->from( it_data ).

  "filter records that are not good for results
  lr_q->filter( iv_field = 'DATABASE'           iv_filter = '-' ). "no empty values in the database
  lr_q->filter( iv_field = 'SAP_SYSTEM_TYPE'    iv_filter = '-' ). "no empty values in the SAP System Type
  lr_q->filter( iv_field = 'INDUSTRY'           iv_filter = '-' ). "no empty values in the Industry
  lr_q->filter( iv_field = 'RUNTIME_MINUTES'    iv_filter = '>=10' ). "Minimum of 10 minutes runtime
  lr_q->filter( iv_field = 'DATABASE_GB_SIZE'   iv_filter = '>=50' ). "Minimum of 50 GB database size

  "sort by runtime
  lr_q->sort( 'RUNTIME_MINUTES' ).

  "execute the query
  rr_data = lr_q->run( ).

ENDFORM.
Basically the magic is done using the SNP/CN01_CL_ITAB_QUERY class, which is part of the SNP Transformation Backbone framework. It enables SQL like query capabilities on ABAP internal tables. This includes transforming field values, which is done using callback mechanisms.

FORM on_virtual_field USING iv_field is_record TYPE any CHANGING cv_value TYPE any.
*"--- DATA DEFINITION -------------------------------------------------
  DATA: lv_database TYPE string.
  DATA: lv_database_version TYPE string.
  DATA: lv_tmp TYPE string.
  DATA: lv_int TYPE i.
  DATA: lv_p(16) TYPE p DECIMALS 1.

  FIELD-SYMBOLS: <lv_value> TYPE any.

*"--- MACRO DEFINITION ------------------------------------------------
  DEFINE mac_get_field.
    clear: &2.
    assign component &1 of structure is_record to <lv_value>.
    if sy-subrc = 0.
      &2 = <lv_value>.
    else.
      return.
    endif.
  END-OF-DEFINITION.

*"--- PROCESSING LOGIC ------------------------------------------------
  CASE iv_field.
    WHEN 'DATABASE'.

      "combine database name and major version to one value
      mac_get_field 'NAME' lv_database.
      mac_get_field 'VERSION' lv_database_version.
      SPLIT lv_database_version AT '.' INTO lv_database_version lv_tmp.
      CONCATENATE lv_database lv_database_version INTO cv_value SEPARATED BY space.

    WHEN 'DATABASE_SIZE'.

      "categorize the database size into 1.5 TB chunks (e.g. "up to 4.5 TB")
      mac_get_field 'DB_USED' cv_value.
      lv_p = ( floor( cv_value / 1500 ) + 1 ) * '1.5'. "simple round to full 1.5TB chunks
      cv_value = /snp/cn00_cl_string_utils=>text( iv_text = 'up to &1 TB' iv_1 = lv_p ).
      TRANSLATE cv_value USING ',.'. "translate commas to dots to the CSV does not get confused

    WHEN 'SAP_CLIENT_TYPE'.

      "fill up the client category type with a default value
      mac_get_field 'CCCATEGORY' cv_value.
      IF cv_value IS INITIAL.
        cv_value = 'T'. "default to (T)est SAP client
      ENDIF.

    WHEN 'IS_UNICODE'.

      "convert the unicode flag into more human readable values
      IF cv_value = abap_true.
        cv_value = 'unicode'.
      ELSE.
        cv_value = 'non-unicode'.
      ENDIF.

  ENDCASE.

ENDFORM.
After that the data looks nice and cleaned up this time like this:


Creating the Dataset

In SAP HANA Cloud Platform Predictive Services you rather create a dataset. This is basically split up into:

  1. Creating a Database Table in a HANA Database 
  2. Uploading Data into that Database Table
  3. Registering the dataset 

This mainly corresponds to creating a datasource with AWS Machine Learning API. However, you do not explicitly train the dataset or create a model. This is done implicitly done - maybe. We'll discover more about that soon.

FORM create_dataset USING ir_hcp_machine_learning TYPE REF TO /snp/hcp01_cl_ml
                          it_table TYPE table
                 CHANGING rv_dataset_id.

  rv_dataset_id = ir_hcp_machine_learning->create_dataset(

    "...by creating a temporary table in a HCP HANA database
    "   instance from an internal table
    "   and inserting the records, so it can be used
    "   as a machine learning data set
    it_table = it_table

  ).

ENDFORM.

My API will create a temporary table for each interal table you are creating a dataset on. It's a column table without a primary key. All column types are determined automatically using runtime type inspection. If colums of the internal table are strings, I rather determine the length by scanning the content than creating CLOBs which are not suited well for Predictive Services.

Please note that uploading speed significantly suffers, if you are inserting content line-by-line, which is the case if cl_sql_statement does not support set_param_table on your release. This also was the case for my system, so I had to build that functionality myself.

After that it is finally time to find the key influencers, that affect the runtime of the SNP System Scan the most...

FORM find_key_influencers USING ir_ml TYPE REF TO /snp/hcp01_cl_ml
                                iv_dataset_id TYPE i.
*"--- DATA DEFINITION -------------------------------------------------
  DATA: lt_key_influencers TYPE /snp/hcp00_tab_key_influencer.

*"--- PROCESSING LOGIC ------------------------------------------------
  "...introspect the model, e.g. finding the features (=columns) that influence
  "   a given figure (=target column) in descending order
  lt_key_influencers = ir_ml->get_key_influencers(

    "which dataset should be inspected
    iv_dataset_id = iv_dataset_id

    "what is the target columns, for which the key influencers
    "should be calculated
    iv_target_column = 'RUNTIME_MINUTES'

    "how many key influencers should be calculated?
    iv_number_of_influencers = 5

  ).

  "DATABASE_SIZE:    37% Influence
  "DATABASE:         23% Influence (e.g. ORACLE 12, SAP HANA DB etc.)
  "SCAN_VERSION:     15% Influence
  "OPERATING_SYSTEM: 10% Influence
  "SAP_SYSTEM_TYPE    5% Influence (e.g. SAP R/3 4.6c; SAP ECC 6.0; S/4 HANA 16.10 etc.)

  "...remove dataset afterwards
  ir_ml->remove_dataset( iv_dataset_id ).

ENDFORM.


As mentioned above databse size, database vendor and scan version were not a suprise. I didn't think that the operating system would have such a big influence, as SAP NetWeaver is abstracting that away. I expected the SAP system type to have more of an influence, as I figured, that different data models will have a bigger impact on performance. So all in all not so many suprises, but then again, that makes the model trustworthy...

Challenges

Along the way I have found some challenges.


  • Authentication: I always seem to have a problem finding the simples things like which flags to set in the authentication mechanism. Just make sure to switch on "Trusted SAML 2.0 identiy provider", "User name and password", "Client Certificate" and "Application-to-Application SSO" on the FORM card of the Authentication Configuration and do not waste hours like me.
  • Upload Speed: As stated above, if you are inserting the contents of you internal table line-by-line you are significantly suffering performance. On the other hand inserting multiple 100k of records was not so much of a problem, once you untap mass insert/update. It may not be available in your ADBC implementation, depending on the version of your SAP NetWeaver stack, so consider backporting it from a newer system. It's definately worth it.
  • Table creation: I am a big fan of dynamic programming despite the performance penalties it has some times. However, when you are creating database tables to persist your dataset in you HCP  HANA database you have to make sure that columns are as narrow as possible for good performance or even relevance of your results.

Features of SAP HCP Predictive Analysis


  • Key Influencers: This is the use case that I have shown in this article
  • Scoring Equation: You can get the code that is doing the calculation of predictions either as an SQL query executable on any HANA database or a score card. The first is basically a decision tree, which can easily be transpiled into other languages and thereby be used for on-premise deliverys on the other hand this show, that the mechanics unterneath the SAP HCP Predictive Analysis application are currently quite simple, which I will dig into more in the conlusion below
  • Forecast: based on a time based column you can predict future values
  • Outliers: You can find exceptions in your dataset. While key influencers are more focussed on the structure, as they represent influencial columns to a result. Outliers show the exceptional rows to the calculated equation.
  • WhatIf: Simulates a planned action and return the significant changes resulting from it
  • Recommendations: Finding interesting products based on a purchase history by product and/or user. This can also be transferred to other recommendation scenarios.


Conclusion

So after this rather lengthy article I have come to a conclusion about SAP HCP Predictive Services, especially compared to AWS Machine Learning capabilities:

Pros
  • Business oriented scenarios: You definately do not have to think as much about good use cases. The API present them to you as shown in "Features" section above.
  • Fast: Calculation is really fast. Predictions are available almost instantaniously. Especially if benchmarked against AWS where building a datasource and training a model took well enough 10 minutes. But do not overestimate this, as the Cons will show you.
  • Introspection: Many services are about looking into the scenario. AWS is just about prediction at the moment. This transparency about dependencies inside the dataset were most interesting for me.
  • On Premise delivery of models via desicion trees: The fact that models are esposed as executable decision trees that can easily be transpiled into any other programming language makes on premise delivery possible. Basically prediction is effortless after doing so. But then again you have to manage the model life cycle and how updates to it are rolled out.
Cons
  • Higher Cost of Infrastructure: At least on the fixed cost part a productive HCP acount is not cheap. But then again there is no variable cost if you are able to deploy your models on premise.
  • Only Regression: Currently target figures have to be nummeric. So only regession problems can be solved. No classification problems. Of course HANA also has natural language processing on board but this is not availble for machine learning purposes per se.
  • Little options for manipulating learning: You just register a model. Nothing said about how training and validation is to be performed, how to normalize data on the platform and so on
  • Trading speed for quality: As stated registering a model is fast, introspecting it etc. is also very fast. But then again I was able to achive different results with the same dataset when I sorted it differently. Consistently. And not just offsetting the model by 2% but rather big time. For example, when sorting my dataset differently the key influencers turned out to be completely different ones. This is actually quite concerning. Maybe I am missing something, but maybe this is why training AWS models takes significantly longer, because they scramble datasets and run multiple passes over it to determine the best an most stable model.
While SAP HCP Predictive Services looks very promising, has good use cases and is appealing especially for it's transparency, stability and reliability have to improve before it's safe to rely on it for business decisions. Well I only have a HCP trial account at the moment, maybe this intentional. Let's see how predictive services on-premise on a HANA 2.0 database are doing...

Tuesday, January 17, 2017

Synthesizing your voice: WaveNet by Google DeepMind

I remember the day (somewhen in the 90s), when computer generated voices sounded - well synthetic. Today you can still tell the difference between a human and a machine speaking to you. Although they have gotten very good. On the other hand, it's probably good that you can tell a difference. Think about all the implicit expectations, that we would have, if we'd thought a human was speaking to us...


But then on the other hand, think about how much already existing content that exists in text form we can leverage given we have a natural sounding voice reading it to us. Many e-learning platforms are already using it but to be honest most of them do not cut it, when they use TTS. It's diffent to watch a youtube tutorial with an energetic tutor that grabs my attention.


But technology keep catching on: WaveNet by Google DeepMind is promising, generating voices from actual audio samples. Imagine: Hearing your voice reading a book or a tutorial, without reading it (yes I know it's akward to hear you own voice when you are not used to it).


Based in deep learning techniques WaveNet picks up subtle notions such as breathing rhythm and individual intonation. Probably energizing the generated TTS with some markup is not so far away...


Thursday, January 05, 2017

Creating AWS Machine Learning Models from ABAP

Hi guys,

extending my previous article about "Using AWS Machine Learning from ABAP to predict runtimes" I have now been able to extend the ABAP based API to create models from ABAP internal tables (which is like a collection of records, for the Non-ABAPers ;-).

This basically enables ABAP developers to utilize Machine Learning full cycle without ever having to leave their home turf or worry about the specifics of the AWS Machine Learning implementations.

My use case still is the same: Predicting runtimes of the SNP System Scan based on well known parameters like database vendor (e.g. Oracle, MaxDB), database size, SNP System Scan version and others. But since my first model was not quite meeting my expectations I wanted to be able to play around easily, adding and removing attributes from the model with a nice ABAP centric workflow. This probably makes it most effective for other ABAP developers to utilize Machine Learning. So let's take a look at the basic structure of the example program:

1:   REPORT /snp/aws01_ml_create_model.  
2:    
3:   START-OF-SELECTION.  
4:    PERFORM main.  
5:    
6:   FORM main.  
7:   *"--- DATA DEFINITION -------------------------------------------------  
8:    DATA: lr_scan_data TYPE REF TO data.  
9:    DATA: lr_prepared_data TYPE REF TO data.  
10:   DATA: lr_ml TYPE REF TO /snp/aws00_cl_ml.  
11:   DATA: lv_model_id TYPE string.  
12:   DATA: lr_ex TYPE REF TO cx_root.  
13:   DATA: lv_msg TYPE string.  
14:    
15:   FIELD-SYMBOLS: <lt_data> TYPE table.  
16:    
17:  *"--- PROCESSING LOGIC ------------------------------------------------  
18:   TRY.  
19:     "fetch the data into an internal table  
20:     PERFORM get_system_scan_data CHANGING lr_scan_data.  
21:     ASSIGN lr_scan_data->* TO <lt_data>.  
22:    
23:     "prepare data (e.g. convert, select features)  
24:     PERFORM prepare_data USING <lt_data> CHANGING lr_prepared_data.  
25:     ASSIGN lr_prepared_data->* TO <lt_data>.  
26:    
27:     "create a model  
28:     CREATE OBJECT lr_ml.  
29:     PERFORM create_model USING lr_ml <lt_data> CHANGING lv_model_id.  
30:    
31:     "check if...  
32:     IF lr_ml->is_ready( lv_model_id ) = abap_true.  
33:    
34:      "...creation was successful  
35:      lv_msg = /snp/cn00_cl_string_utils=>text( iv_text = 'Model &1 is ready' iv_1 = lv_model_id ).  
36:      MESSAGE lv_msg TYPE 'S'.  
37:    
38:     ELSEIF lr_ml->is_failed( lv_model_id ) = abap_true.  
39:    
40:      "...creation failed  
41:      lv_msg = /snp/cn00_cl_string_utils=>text( iv_text = 'Model &1 has failed' iv_1 = lv_model_id ).  
42:      MESSAGE lv_msg TYPE 'S' DISPLAY LIKE 'E'.  
43:    
44:     ENDIF.  
45:    
46:    CATCH cx_root INTO lr_ex.  
47:    
48:     "output errors  
49:     lv_msg = lr_ex->get_text( ).  
50:     PERFORM display_lines USING lv_msg.  
51:    
52:   ENDTRY.  
53:    
54:  ENDFORM.  

And now let's break it down into it's individual parts:

Fetch Data into an Internal Table

In my particular case I was fetching the data via a REST Service from the SNP Data Cockpit instance I am using to keep statistics on all executed SNP System Scans. However, you can basically fetch your data that will be used as a data source for your model in any way that you like. Most probably you will be using OpenSQL SELECTs to fetch the data accordingly. Resulting data looks somewhat like this:

Prepare Data

This is the raw data and it's not perfect! The data quality is not quite good and in the shape that it's in. According to this article there are some improvements that I need to do in order to improve its quality.
  • Normalizing values (e.g. lower casing, mapping values or clustering values). E.g.
    • Combining the database vendor and the major version of the database because those two values only make sense when treated in combination and not individually
    • Clustering the database size to 1.5TB chunks as these values can be guessed easier when executing predictions
    • Clustering the runtime into exponentially increasing categories (although this may also hurt accuracy...)
  • Filling up empty values with reasonable defaults. E.g.
    • treating all unknown SAP client types as test clients
  • Make values and field names more human readable. This is not necessary for the machine learning algorithms, but it makes for better manual result interpretation
  • Removing fields that do not make good features, like 
    • IDs
    • fields that cannot be provided for later predictions, because values cannot be determined easily or intuitively
  • Remove records that still do not have good data quality. E.g. missing values in
    • database vendors
    • SAP system types
    • customer industry
  • Remove records that are not representative. E.g. 
    • they refer to scans with exceptionally short runtimes probably due to intentionally limiting the scope
    • small database sizes that are probably due to non productive systems
1:   FORM prepare_data USING it_data TYPE table CHANGING rr_data TYPE REF TO data.  
2:   *"--- DATA DEFINITION -------------------------------------------------  
3:    DATA: lr_q TYPE REF TO /snp/cn01_cl_itab_query.  
4:    
5:   *"--- PROCESSING LOGIC ------------------------------------------------  
6:    CREATE OBJECT lr_q.  
7:    
8:    "selecting the fields that make good features  
9:    lr_q->select( iv_field = 'COMP_VERSION'      iv_alias = 'SAP_SYSTEM_TYPE' ).  
10:   lr_q->select( iv_field = 'DATABASE'          iv_uses_fields = 'NAME,VERSION' iv_cb_program = sy-repid iv_cb_form = 'ON_VIRTUAL_FIELD' ).  
11:   lr_q->select( iv_field = 'DATABASE_SIZE'     iv_uses_fields = 'DB_USED' iv_cb_program = sy-repid iv_cb_form = 'ON_VIRTUAL_FIELD' ).  
12:   lr_q->select( iv_field = 'OS'                iv_alias = 'OPERATING_SYSTEM' ).  
13:   lr_q->select( iv_field = 'SAP_CLIENT_TYPE'   iv_uses_fields = 'CCCATEGORY' iv_cb_program = sy-repid iv_cb_form = 'ON_VIRTUAL_FIELD' ).  
14:   lr_q->select( iv_field = 'COMPANY_INDUSTRY1' iv_alias = 'INDUSTRY' ).  
15:   lr_q->select( iv_field = 'IS_UNICODE'        iv_cb_program = sy-repid iv_cb_form = 'ON_VIRTUAL_FIELD' ).  
16:   lr_q->select( iv_field = 'SCAN_VERSION' ).  
17:   lr_q->select( iv_field = 'RUNTIME'           iv_uses_fields = 'RUNTIME_HOURS' iv_cb_program = sy-repid iv_cb_form = 'ON_VIRTUAL_FIELD' ).  
18:    
19:   "perform the query on the defined internal table  
20:   lr_q->from( it_data ).  
21:    
22:   "filter records that are not good for results  
23:   lr_q->filter( iv_field = 'DATABASE'         iv_filter = '-' ). "no empty values in the database  
24:   lr_q->filter( iv_field = 'SAP_SYSTEM_TYPE'  iv_filter = '-' ). "no empty values in the SAP System Type  
25:   lr_q->filter( iv_field = 'INDUSTRY'         iv_filter = '-' ). "no empty values in the Industry  
26:   lr_q->filter( iv_field = 'RUNTIME_MINUTES'  iv_filter = '>=10' ). "Minimum of 10 minutes runtime  
27:   lr_q->filter( iv_field = 'DATABASE_GB_SIZE' iv_filter = '>=50' ). "Minimum of 50 GB database size  
28:    
29:   "sort by runtime  
30:   lr_q->sort( 'RUNTIME_MINUTES' ).  
31:    
32:   "execute the query  
33:   rr_data = lr_q->run( ).  
34:    
35:  ENDFORM.  

Basically the magic is done using the SNP/CN01_CL_ITAB_QUERY class, which is part of the SNP Transformation Backbone framework. It enables SQL like query capabilities on ABAP internal tables. This includes transforming field values, which is done using callback mechanisms.


1:   FORM on_virtual_field USING iv_field is_record TYPE any CHANGING cv_value TYPE any.  
2:   
3:    "...  
4:    
5:    CASE iv_field.  
6:     WHEN 'DATABASE'.  
7:    
8:      "combine database name and major version to one value  
9:      mac_get_field 'NAME' lv_database.  
10:     mac_get_field 'VERSION' lv_database_version.  
11:     SPLIT lv_database_version AT '.' INTO lv_database_version lv_tmp.  
12:     CONCATENATE lv_database lv_database_version INTO cv_value SEPARATED BY space.  
13:    
14:    WHEN 'DATABASE_SIZE'.  
15:    
16:     "categorize the database size into 1.5 TB chunks (e.g. "up to 4.5 TB")  
17:     mac_get_field 'DB_USED' cv_value.  
18:     lv_p = ( floor( cv_value / 1500 ) + 1 ) * '1.5'. "simple round to full 1.5TB chunks  
19:     cv_value = /snp/cn00_cl_string_utils=>text( iv_text = 'up to &1 TB' iv_1 = lv_p ).  
20:     TRANSLATE cv_value USING ',.'. "translate commas to dots to the CSV does not get confused  
21:    
22:    WHEN 'SAP_CLIENT_TYPE'.  
23:    
24:     "fill up the client category type with a default value  
25:     mac_get_field 'CCCATEGORY' cv_value.  
26:     IF cv_value IS INITIAL.  
27:      cv_value = 'T'. "default to (T)est SAP client  
28:     ENDIF.  
29:    
30:    WHEN 'IS_UNICODE'.  
31:    
32:     "convert the unicode flag into more human readable values  
33:     IF cv_value = abap_true.  
34:      cv_value = 'unicode'.  
35:     ELSE.  
36:      cv_value = 'non-unicode'.  
37:     ENDIF.  
38:    
39:    WHEN 'RUNTIME'.  
40:    
41:     "categorize the runtime into human readable chunks  
42:     mac_get_field 'RUNTIME_HOURS' lv_int.  
43:     IF lv_int <= 1.  
44:      cv_value = 'up to 1 hour'.  
45:     ELSEIF lv_int <= 2.  
46:      cv_value = 'up to 2 hours'.  
47:     ELSEIF lv_int <= 3.  
48:      cv_value = 'up to 3 hours'.  
49:     ELSEIF lv_int <= 4.  
50:      cv_value = 'up to 4 hours'.  
51:     ELSEIF lv_int <= 5.  
52:      cv_value = 'up to 5 hours'.  
53:     ELSEIF lv_int <= 6.  
54:      cv_value = 'up to 6 hours'.  
55:     ELSEIF lv_int <= 12.  
56:      cv_value = 'up to 12 hours'.  
57:     ELSEIF lv_int <= 24.  
58:      cv_value = 'up to 1 day'.  
59:     ELSEIF lv_int <= 48.  
60:      cv_value = 'up to 2 days'.  
61:     ELSEIF lv_int <= 72.  
62:      cv_value = 'up to 3 days'.  
63:     ELSE.  
64:      cv_value = 'more than 3 days'.  
65:     ENDIF.  
66:    
67:   ENDCASE.  
68:    
69:  ENDFORM.  

After running all those preparations, the data is transformed into a record set that looks like this:


Create a Model

Ok, preparing data for a model is something that the developer has to do for each individual problem he wants to solve. But I guess this is done better if performed in a well known environment. After all this is the whole purpose of the ABAP API. Now we get to the parts that's easy, as creating the model based on the internal table we have prepared so far is fully automated. As a developer you are completely relieved from the following tasks:

  • Converting the internal table into CSV
  • Uploading it into an AWS S3 bucket and assigning the correct priviledges, so it can be used for machine learning
  • Creating a data source based on the just uploaded AWS S3 object and providing the input schema (e.g. which fields are category fields, which ones are numeric etc.). As this information can automatically be derived from DDIC information
  • Creating a model from the datasource
  • Training the model
  • Creating an URL Endpoint so the model can be used for predictions as seen in the previous article.
That's quite a lot of stuff, that you do not need to do anymore. Doing all this is just one API call away:

1:   FORM create_model USING ir_aws_machine_learning TYPE REF TO /snp/aws00_cl_ml  
2:                           it_table TYPE table  
3:                  CHANGING rv_model_id.  
4:    
5:     rv_model_id = ir_aws_machine_learning->create_model(  
6:    
7:     "...by creating a CSV file from an internal table  
8:     "  and upload it to AWS S3, so it can be used  
9:     "  as a machine learning data source  
10:    it_table = it_table  
11:    
12:    "...by defining a target field that is used  
13:    iv_target_field = 'RUNTIME'  
14:    
15:    "...(optional) by defining a title  
16:    iv_title = 'Model for SNP System Scan Runtimes'  
17:    
18:    "...(optional) to create an endpoint, so the model  
19:    "  can be used for predictions. This defaults to  
20:    "  true, but you may want to switch it off  
21:    
22:    " IV_CREATE_ENDPOINT = ABAP_FALSE  
23:    
24:    "...(optional) by defining fields that should be  
25:    "  treated as text rather than as a category.  
26:    "  By default all character based fields are treated  
27:    "  as categorical fields  
28:    
29:    " IV_TEXT_FIELDS = 'COMMA,SEPARATED,LIST,OF,FIELDNAMES'  
30:    
31:    "...(optional) by defining fields that should be  
32:    "  treated as numerical fields rather than categorical  
33:    "  fields. By detault the type will be derived from the  
34:    "  underlying data type, but for convenience reasons  
35:    "  you may want to use this instead of creating and  
36:    "  filling a completely new structure  
37:    
38:    " IV_NUMERIC_FIELDS = 'COMMA,SEPARATED,LIST,OF,FIELDNAMES'  
39:    
40:    "...(optional) by defining if you want to create the model  
41:    "  synchronously or asynchronously. By default a the  
42:    "  datasource, model, evaluation and endpoint are created  
43:    "  synchronously so that after returning from the method call  
44:    "  you can immediately start with predictions.  
45:    
46:    " IV_WAIT = ABAP_TRUE by default  
47:    " IV_SHOW_PROGRESS = ABAP_TRUE by default  
48:    " IV_REFRESH_RATE_IN_SECS = 5 seconds by default  
49:    
50:   ).  
51:    
52:  ENDFORM.  

As you see, most stuff is optional. Sane default values are provided that assume synchronously uploading the data, creating the datasource, model, training and endpoint. So you can directly perform predictions afterwards. Creating all of this in an asynchronous fashion is also possible. Just in case you do not rely on performing predictions directly. After all, the whole process takes up 10 to 15 minutes - which is why showing progress becomes important, especially since you do not want to run into time out situations, when doing this in online mode with a GUI connected.

The Result

After all is done, you can perform predictions. Right let's just hop over into AWS machine learning console and see the results:

A CSV file was created in an AWS S3 bucket...


...then a datasource, ML model and an evaluation for training the model were created (also an endpoint, but the screenshot does not show it) ...


...and finally we can inspect the model performance.

Conclusion

This is a big step towards making Machine Learning available to many without the explicit need to cope with vendor specific aspects. However understanding the principles of machine learning, especially in regards to the problems, you can apply it to and what good data quality means for good predictions is a requirement.

Machine Learning Recipes

A cool series for learning the principles of machine learning...














Sunday, January 01, 2017

Using AWS Machine Learning from ABAP to predict runtimes

Happy new year everybody!

Today I tried out Amazon's Machine Learning capabilities. After running over the basic AWS Machine Learning tutorial and getting to know how the guys at AWS deal with the subject I got quite exited.



Everythings sounds quite easy:

  1. Prepare example data in a single CSV file with good and distinct features for test and training purposes
  2. Create a data source from that CSV file, which basically means verifying that the column types were detected correctly and specifying a result column. 
  3. Create a Machine Learning model from the data source, running an evaluation on it
  4. Create an Endpoint, so your model becomes consumable via a URL based service

My example use case was to predict the runtime of one of our analysis tools - SNP System Scan - given some system parameters. In general any software will probably benefit from good runtime predictions as this is a good way to improve the user experience. We all know the infamous progress bar metaphor that quickly reaches 80% but then takes ages to get to 100%. As a human being I expect progress to be more... linear ;-)


So this seems like a perfect starting point for exploring Machine Learning. I got my data perpared and ran through all the above steps. I was dealing with numerical and categorical columns with my datasource but also boolean and text are available. Text is good for unstructured data such as natural language analysis, but I did not get into that yet. Everything so far was quite easy and went well.

Now I needed to incorporate the results into the software, which is in ABAP. Hmmm, no SDK for ABAP. Figured! But I still want to enable all my colleagues to take advantage of this new buzzword techology and play around with it. I decided for a quick implementation using the proxy pattern.


So I have created an ABAP based API that calls a PHP based REST Service via HTTP, which then utilizes the PHP SDK for AWS to talk to the AWS Machine Learning Endpoint I previously created.

For the ABAP part I wanted to be both as easy and as generic as possible, so the API should work with any ML model and any record structure. The way that ABAP application developers would interact with this API would look like this:


REPORT  /snp/aws01_ml_predict_scan_rt.

PARAMETERS: p_comp TYPE string LOWER CASE OBLIGATORY DEFAULT 'SAP ECC 6.0'.
PARAMETERS: p_rel TYPE string LOWER CASE OBLIGATORY DEFAULT '731'.
PARAMETERS: p_os TYPE string LOWER CASE OBLIGATORY DEFAULT 'HP-UX'.
PARAMETERS: p_db TYPE string LOWER CASE OBLIGATORY DEFAULT 'ORACLE 12'.
PARAMETERS: p_db_gb TYPE i OBLIGATORY DEFAULT '5000'. "5 TB System
PARAMETERS: p_uc TYPE c AS CHECKBOX DEFAULT 'X'. "Is this a unicode system?
PARAMETERS: p_ind TYPE string LOWER CASE OBLIGATORY DEFAULT 'Retail'. "Industry
PARAMETERS: p_svers TYPE string LOWER CASE OBLIGATORY DEFAULT '16.01'. "Scan Version

START-OF-SELECTION.
  PERFORM main.

FORM main.
*"--- DATA DEFINITION -------------------------------------------------
  "Definition of the record, based on which a runtime predition is to be made
  TYPES: BEGIN OF l_str_system,
          comp_version TYPE string,
          release TYPE string,
          os TYPE string,
          db TYPE string,
          db_used TYPE string,
          is_unicode TYPE c,
          company_industry1 TYPE string,
          scan_version TYPE string,
         END OF l_str_system.

  "AWS Machine Learning API Class
  DATA: lr_ml TYPE REF TO /snp/aws00_cl_ml.
  DATA: ls_system TYPE l_str_system.
  DATA: lv_runtime_in_mins TYPE i.
  DATA: lv_msg TYPE string.
  DATA: lr_ex TYPE REF TO cx_root.

*"--- PROCESSING LOGIC ------------------------------------------------
  TRY.
      CREATE OBJECT lr_ml.

      "set parameters
      ls_system-comp_version = p_comp.
      ls_system-release = p_rel.
      ls_system-os = p_os.
      ls_system-db = p_db.
      ls_system-db_used = p_db_gb.
      ls_system-is_unicode = p_uc.
      ls_system-company_industry1 = p_ind.
      ls_system-scan_version = p_svers.

      "execute prediction
      lr_ml->predict(
        EXPORTING
          iv_model   = 'ml-BtUpHOFhbQd' "model name previously trained in AWS
          is_record  = ls_system
        IMPORTING
          ev_result  = lv_runtime_in_mins
      ).

      "output results
      lv_msg = /snp/cn00_cl_string_utils=>text( iv_text = 'Estimated runtime of &1 minutes' iv_1 = lv_runtime_in_mins ).
      MESSAGE lv_msg TYPE 'S'.

    CATCH cx_root INTO lr_ex.

      "output errors
      lv_msg = lr_ex->get_text( ).
      PERFORM display_lines USING lv_msg.

  ENDTRY.

ENDFORM.

FORM display_lines USING iv_multiline_test.
*"--- DATA DEFINITION -------------------------------------------------
  DATA: lt_lines TYPE stringtab.
  DATA: lv_line TYPE string.

*"--- PROCESSING LOGIC ------------------------------------------------
  "split into multiple lines...
  SPLIT iv_multiline_test AT cl_abap_char_utilities=>newline INTO TABLE lt_lines.
  LOOP AT lt_lines INTO lv_line.
    WRITE: / lv_line. "...and output each line individually
  ENDLOOP.

ENDFORM.

Now on the PHP side I simply used the AWS SDK for PHP. Setting it up is as easy as extracting a ZIP file, require the auto-load mechanism and just use the API. I wrote a little wrapper class that I could easily expose as a REST Service (not shown here).

<?php

class SnpAwsMachineLearningApi {

   /**
   * Create an AWS ML Client Object
   */
   private function getClient($key,$secret) {
      return new Aws\MachineLearning\MachineLearningClient([
         'version' => 'latest',
         'region'  => 'us-east-1',
         'credentials' => [
            'key'    => $key,
            'secret' => $secret
         ],
      ]);
   }

   /**
   * Determine the URL of the Model Endpoint automatically
   */
   private function getEndpointUrl($model,$key,$secret) {

      //fetch metadata of the model
      $modelData = $this->getClient($key,$secret)->getMLModel([
         'MLModelId'=>$model,
         'Verbose'=>false
      ]);

      //check if model exists
      if(empty($modelData)) {
         throw new Exception("model ".$model." does not exist");
      }

      //getting the endpoint info
      $endpoint = $modelData['EndpointInfo'];

      //check if endpoint was created
      if(empty($endpoint)) {
         throw new Exception("no endpoint exists");
      }

      //check if endpoint is ready
      if($endpoint['EndpointStatus'] != 'READY') {
         throw new Exception("endpoint is not ready");
      }

      //return the endpoint url
      return $endpoint['EndpointUrl'];
   }

   /**
   * Execute a prediction
   */
   public function predict($model,$record,$key,$secret) {
      return $this->getClient($key,$secret)->predict(array(

          //provide the model name
         'MLModelId'       => $model,

         //make sure it's an associative array that is passed as the record
         'Record'          => json_decode(json_encode($record),true),

         //determine the URL of the endpoint automatically, assuming there is
         //only and exactely one
         'PredictEndpoint' => $this->getEndpointUrl($model,$key,$secret)
      ));
   }

}

And that is basically it. Of course for the future it would be great to get rid of the PHP part and have an SDK implementation purely ABAP based but again, this was supposed to be a quick and easy implementation.

Currently it enables ABAP developers to execute predictions on AWS Machine Learning Platform on any trained model without having to leave their terrain.

In the future this could be extended to initially providing or updating datasources from ABAP internal tables, creating and training models on the fly and of course abstracting stuff even so far, that other Machine Learning providers can be plugged in. So why not explore the native SAP HANA capabilities next...