Matillion has the functionality to run Python scripts as part of a Matillion job run. This is used for setting variable values, running quick queries on the target database using database cursors, or connecting to other AWS services. However, there are some circumstances where the amount of processing required by the Python script means it can be more sensible to run the Python script elsewhere to minimise the impact on the rest of the Matillion server. This can be because the Python script requires a large amount of processing power, which may affect the resources on the EC2 instance, which is used to host the Matillion application or because of corporate requirements around security or execution isolation.
Examples of where this is the case include processing large files such as running XML to JSON conversion, or zipping/unzipping large files, or running API Calls that require or return a lot of data. In this blog we will look at a simple Python script that is used to convert an XML file to a JSON file so it can be loaded into the target database. This script, while simple, can require a lot of memory and processing power to convert large files.
Below are some options for running Python scripts outside of the Matillion instance using the Bash Script component.
It’s possible to run the script remotely on another EC2 instance using an SSH command to connect to the instance. If required, the AWS CLI can even be used to start and stop the instance on demand. The SSH command is used to connect to the instance using a key stored on the Matillion Server. The script is run using the “python” command. In this example, the Python script is passed through to the remote server using a “heredoc”. Below is an example script:
ssh -o StrictHostKeyChecking=no -i id_rsa ec2-user@ip-172-31-30-130.eu-west-1.compute.internal "python" << END_OF_PYTHON import xmltodict, json, boto3 s3 = boto3.resource('s3') s3.meta.client.download_file('mtln-laura-test', 'example.xml', '/tmp/example.xml’) with open ("/tmp/example.xml", "r") as myfile: xml_data=myfile.read() o = xmltodict.parse(xml_data) #print json.dumps(o) open('/tmp/example.json', 'w').write(json.dumps(o)) s3.meta.client.upload_file('/tmp/example.json, 'mtln-laura-test', 'example.json') print 'file converted successfully' END_OF_PYTHON
The result of the Python script will be printed in the Bash window:
Advantages:
Disadvantages:
AWS Lambda allows you to run code without worrying about servers. You simply pay for the compute time you use. Python 2.7, Python 3.6 and Python 3.7 are amongst the languages supported. To run a Python script in AWS Lambda, a Lambda function must first be created and then run when required. Functions can be created and run through the AWS web console. A simple example function is shown below:
The AWS CLI supports creating a Python function in Lambda and then running it. These CLI commands can be run from a Bash Script component within Matillion. The example script below assumes the helloworld.py Python script exists on the Matillion server in the /tmp directory and walks through the steps required to use this to create a Lambda function from it and run it accordingly.
Please note the below will redefine the function over and over. If the code does not change, it can simply be run using the run statement:
#Lambda requires function to be run is called index.py so first rename python function cp /tmp/xmltojson.py index.py #zip function up zip function.zip index.py #delete function if already exists aws lambda delete-function --function-name helloworld --region eu-west-1 #create function in Lambda aws lambda create-function --function-name helloworld --zip-file fileb://~/function.zip --handler index.handler --runtime python2.7 --role arn:aws:iam::557249109363:role/lambda-cli-role --region eu-west-1 #run function aws lambda invoke --function-name helloworld --log-type Tail /tmp/outputfile.txt --region eu-west-1 #delete again aws lambda delete-function --function-name helloworld --region eu-west-1
Note the AWS Credentials for the Environment used to run this job requires the IAM policy AWSLambdaBasicExecutionRole.
Advantages:
Disadvantages:
AWS Glue can be used to create and run Python Shell jobs. These are Python scripts which are run as a shell script, rather than the original Glue offering of only running PySpark. Any script can be run, providing it is compatible with 2.7.
As with the Lambda function, first an AWS Glue ETL job must be created, as a Python Shell job, and then executed. The AWS Glue job is created by linking to a Python script in S3, a IAM role is granted to run the Python script under and any connections available connections, such as to Redshift are selected:
Again, the Glue Job can be created either via the console or the AWS CLI. Below are the steps to create and run the job via the AWS CLI from a Bash script component within Matillion. If the code doesn’t change, we recommend creating this job once and simply running the job from Matillion:
#script must exist on S3 for glue aws s3 cp /tmp/xmltojson.py s3://mtln-laura-ireland/xmltojson.py --region eu-west-1 aws glue create-job --name python-job-cli --role jc-glue-role --command '{"Name" : "LauraPythonTest", "ScriptLocation" : "s3://mtln-laura-ireland/xmltojson.py"}' --region eu-west-1 #output results to a variable to use in Cloudwatch command JOBOUTPUT=$(aws glue start-job-run --job-name LauraPythonTest --region eu-west-1) #get job id from json JOBID=$(echo $JOBOUTPUT| cut -d '"' -f 4) echo $JOBID #get logs sleep 30 aws logs get-log-events --log-group-name /aws-glue/python-jobs/output --log-stream-name $JOBID --region eu-west-1 > /tmp/output.txt
In AWS Glue, the output of a job run is written to Cloudwatch. If these are required in Matillion, the output of the command has to be written to a file on the Matillion server.
Note to run this script the Matillion environment requires the below policies:
It is often the case that there is a requirement that part of the Python script is made dynamic. This is achieved using variables in Matillion. For example in the SSH example discussed earlier, the S3 Bucket and file names were hard coded but it is unlikely that these filenames will always be consistent.
It is also unrealistic to update the Python script to simply update the file or the bucket names. Instead job variables in the Matillion job can set up and these can be referenced in the heredoc, which pushes the variable value to the remote server:
ssh -o StrictHostKeyChecking=no -i id_rsa ec2-user@ip-172-31-30-130.eu-west-1.compute.internal "python" << END_OF_PYTHON import xmltodict, json, boto3 s3 = boto3.resource('s3') s3.meta.client.download_file('$s3Bucket', '$xmlFile', '/tmp/$xmlFile') with open ("/tmp/$xmlFile", "r") as myfile: xml_data=myfile.read() o = xmltodict.parse(xml_data) #print json.dumps(o) open('/tmp/$xmlFile.json', 'w').write(json.dumps(o)) s3.meta.client.upload_file('/tmp/$xmlFilejson', '$s3Bucket', '$xmlFile.json') print 's3://$s3Bucket/$xmlFile.json' END_OF_PYTHON
The script has also been edited to return the bucket and filename created.
The output from the script run can also be pushed to an output file, which can then be read by another Matillion component and passed into another variable value, if required.
Some of the above options require that the Python to run is available as a file on the Matillion server. However the user may wish to write the file from within Matillion. Here we discuss options for doing this.
The natural place to write the Python code is using the Python script component. Here you can take advantage of the colour coding of the code written and also use this approach to convert Python scripts already written into files to be run externally. The approach uses a Bash script component to export the job with the Python script using the Matillion API, parse out the relevant Python and write the results to a file in /tmp. It requires a number of inputs to make the API call and identify the correct script.
When the script has run it will either say it has successfully created the file or throw an error saying the API call was unsuccessful and why or throw an error saying the name of the Python script component was incorrect. Please note all variables are case sensitive.
A different approach is to write the Python code directly in to a Bash Script component and use this to generate a file. This can be done with a heredoc. An example is in the screenshot:
The advantage of this approach is that it prevents the need for the API call and is also faster to create the file. It also means the users aren’t tempted to “Run” the code from within Matillion.
We recommend before developing Python scripts within Matillion that you carefully consider what the script will do and consider running Python scripts which require considerable resources remotely. For a large xml file, the Python script above can take up to 10 mins to run. Having this run on the Matillion server would affect the CPU and memory available to run the Matillion jobs and could also have an adverse effect for Matillion users.
Using the remote options discussed ensures the script can be initiated from Matillion but also run in isolation. The result of the script can easily be fed back to Matillion. There are, however, some cases where it is best to run the script directly within Matillion, for example any Jython code.