Stata in DXJupyterLab

Using Stata via DXJupyterlab, working with project files, and creating Datasets with Spark.

Stata in DXJupyterLab

Stata, a statistical software package for data science, is supported via Stata kernel in the DXJupyterLab App.

To get access to Stata, launch DXJupyterLab App with thefeatureinput set toSTATA. The app requires a valid Stata license information in a DNAnexus file specified by thestataSettingsapp input that defaults to /.stataSettings.user-<username>.jsonfile in the project where the app is run.

Once the DXJupyterLab app is running, you can create and edit Stata notebooks directly in the DNAnexus project by following DXJupyterLab guide and selecting Stata Kernel from the kernel menu. You can also create a Stata notebook in the DXJupyterLab container by clicking the Stata icon in the Launcher tab.

Selecting Stata Kernel for a new DNAnexus-hosted notebook
Getting help, and using Stata shell ("!") command inside Stata notebook
Stata work in action

License Information File Format

The file with license information may have the following formats:

  • Direct format:

    {
    "license": {
    "serialNumber": "<Serial number from Stata>",
    "code": "<Code from Stata>",
    "authorization": "<Authorization from Stata>",
    "user": "<Registered user line 1>",
    "organization": "<Registered user line 2>"
    }
    }
  • A more secure indirect format. $dnanexus_linkreferencesfile-xxxxin a private project-yyyythat contains the license information in direct format.

    {
    "licenseFile": {
    "$dnanexus_link": {
    "id": "file-xxxx",
    "project": "project-yyyy"
    }
    }
    }
  • Use this format if you are working in a shared project and don't want to expose your Stata license information to other members of the shared project. In this situation, create a private project, store the license information in direct format in the private project file, then create a /.stataSettings.user-<username>.json in indirect format in the shared project that references the private project file. See an example of this below:

    $ dx new project mycredentials --brief -s
    project-12345
    # don't share mycredentials project with anybody.
    $ echo '{"license": {"serialNumber": "XXXXXXXXXXXX", "code": "XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX", "authorization": "XXXX", "user":"'$(dx whoami)'", "organization":"dnanexus-user"}}' \
    | dx upload --destination privateStataSettings.json --brief -
    file-abcd
    $ dx select shared_project
    $ echo '{"licenseFile":{"$dnanexus_link":{"id":"file-abcd","project":"project-12345"}}}' \
    | dx upload --destination .stataSettings.user-$(dx whoami).json --brief -

Note that creation of the private credentials project in the context of UK Biobank Research Analysis Platform has to be done from the Research Analysis Platform Projects page.

Working with Project Files

You can download DNAnexus data files to DXJupyterLab container from Stata notebook with:

!dx download project-xxxx:file-yyy

Data files in the current project can also be accessed using a /mnt/project folder from a Stata notebook as follows: To load a DTA file:

use /mnt/project/<path>/data_in.dta

To load a CSV file:

import delimited /mnt/project/<path>/data_in.csv

To write a DTA file to the DXJupyterLab container:

save data_out

To write a CSV file to the DXJupyterLab container:

export delimited data_out.csv

To upload a data file from the DXJupyterLab container to the project, use the following command in a Stata notebook:

!dx upload <file> --destination=<destination>

Alternatively, open a new Launcher tab, open Terminal, and run:

dx upload <file> --destination=<destination>

Note that /mnt/project directory is read-only, so trying to write to it results in an error.

Creating a Stata Dataset with Spark

DXJupyterLab spark cluster app can be used to query and filter DNAnexus datasets returning a PySpark DataFrame. PySpark Dataframe can be converted to a pandas DataFrame with:

pandas_df = spark_df.toPandas()

Pandas dataframe can be exported to CSV or Stata DTA files in the JupyterLab container with:

pandas_df.to_stata("data_out.dta")
pandas_df.to_csv("data_out.csv")

To upload a data file from the JupyterLab container to the DNAnexus project in the DXJupyterLab spark cluster app, use

%%bash
dx upload <file>

Once saved to the project, data files can be used in a DXJupyterLab Stata session using the instructions above.