Files

Files

A file object can be used to store an opaque array of bytes (i.e. what is traditionally known as a "file"). File objects contain binary data, and are immutable. After a file has been uploaded, its contents cannot be modified.

Lifecycle

File objects are stateful. The following diagram represents the possible states (boxes) and actions. When a new file object is made (by calling new), it is initially empty and its state is "open". In that state, file contents can be uploaded in multiple parts (by calling upload), until a request is made to finalize the file (by calling close). File object finalization is not instantaneous, hence the file object advances to the "closing" state and remains in that state for as long as it is needed. In that state, contents may not be uploaded or downloaded, until the system has finalized the file. Once finalization is done by the system, the file object will advance to the "closed" state. In that state, file contents can be downloaded (by calling download). Files that are in the "open" or "closing" state for too long without any activity will be considered abandoned and deleted after some time. The user will receive a notification of such stale files after 24 hours, and after a few days the files will be deleted.

Uploading

In this modern era of genomics, datasets are very large, and transferring an incredibly large file over a single HTTP call can become a daunting task. For this reason, uploading files in multiple, smaller parts is the de facto way of introducing files into the DNAnexus platform, and has many benefits; it allows for a robust, resumeable and parallelizable upload experience. Therefore, DNAnexus supports uploading data in parts, and to encourage efficient upload practices, the system limits each part size to the 5MB-5GB range.

The upload call takes several arguments that indicate which part is to be uploaded and other information specific to that part. The server returns a preauthenticated upload URL specific to that file object and part index, along with several headers that the client must provide with the subsequent HTTP PUT. The user can then upload the part to that URL by doing an HTTP PUT with the content of the part (such as when using "curl -T -X PUT"), along with the headers returned to the client, without providing any other special authentication headers. Users are allowed to upload the same part multiple times (by performing both an upload and matching PUT for a part more than once), but only the last successful PUT will be considered canonical.

The close call will perform finalization, effectively "concatenating" the parts so that once the file is closed, the part distinction is no longer there, and the original file can be subsequently downloaded using the download call. Parts are concatenated in order of ascending part index. Indices do not need to be consecutive.

Closing a file object is only possible if all parts have been uploaded, i.e. if for every index supplied in any upload call so far, the user has successfully performed a PUT to the respective URL received. Closing will not conclude until all parts have been succesfully uploaded. If the user does not complete a part upload for any file part previously created through an upload call, the close call will succeed but the file will remain stuck in "closing" until enough time passes for it to be considered abandoned, and then deleted. Therefore it is imperative that close only be called after all parts have been successfully uploaded.

When all parts have been successfully uploaded, the closing process will usually take on the order of a few seconds to minutes, depending on the size of the file. In rare cases it can take much longer.

The user can query the status of the file object by using the describe call.

Limits on Parts

There are certain limits on part sizes and numbers. These limits are given by the fileUploadParameters field of the /project-xxxx/describe output of the project (or container) that contains the file:

  • Parts have a maximum size, in bytes

  • Parts may have a minimum size, in bytes

  • The completed file has a maximum size, in bytes

  • There is a maximum number of parts that may be uploaded

  • There may be a minimum number of parts that may be uploaded

See the documentation of /project-xxxx/describe for further details about how to interpret it; the client should call this route before beginning the upload to obtain the appropriate limits and break the file into appropriately sized chunks.

For reference, the default parameters (for projects whose region begins with "aws:") are the following:

  • maximumPartSize: 5368709120 (5 GiB)

  • minimumPartSize: 5242880 (5 MiB)

  • maximumFileSize: 5497558138880 (5 TiB)

  • maximumNumParts: 10000

  • emptyLastPartAllowed: true

Downloading

The download call returns a preauthenticated URL which can be used to download the file via a simple HTTP GET. The service behind that URL supports the "Range" header of the HTTP standard, allowing for any byte range to be downloaded, and enabling compatibility with download accelerators that fetch multiple ranges in parallel to increase throughput.

Removal From a Project

If a file object is removed from the project before it is closed, then in addition to removing this file object from the system, the operation results in the following actions:

  • Any previously generated upload or download URLs are invalidated.

  • Existing connections to previously generated upload or download URLs

    may close or fail with a 500 code (the exact behavior is undefined).

ACCOUNTING NOTE: A file increases byte usage by its size. Byte usage is counted upon upload completion.

File API Method Specifications

API method: /file/new

Specification

Creates a new file object. The file is initially in the "open" state. This call can optionally receive an Internet Media Type to associate with that file object. (DNAnexus uses this solely for the purpose of supplying the "Content-Type:" HTTP header when responding to download requests of the file. The Internet Media Type is used by web browsers to identify the kind of data stored in a file, and aid them in deciding what to do with the file when fetching it in their context). All values are accepted without further validation (and sent back as-is in the "Content-Type:" header when a file is downloaded), so long as they contain only characters in the ASCII range 33-126. If the "media" field is not provided, or is set to "", then the system will attempt to auto-detect the Internet Media Type.

Inputs

  • project string ID of the project or container to which the record should belong (e.g. the string "project-xxxx")

  • name string (optional, default is the new ID) The name of the object

  • tags array of strings (optional) Tags to associate with the object

  • types array of strings (optional) Types to associate with the object

  • hidden boolean (optional, default false) Whether the object should be hidden

  • properties mapping (optional) Properties to associate with the object

    • key Property name

    • value string Property value

  • details mapping or array (optional, default { }) JSON object or array that is to be associated with the object; see the Object Details section for details on valid input

  • folder string (optional, default "/") Full path of the folder that is to contain the new object

  • parents boolean (optional, default false) Whether all folders in the path provided in folder should be created if they do not exist

  • media string (optional, default "") The Internet Media Type (formerly known as MIME type or Content-type) of the file

  • nonce string (optional) Unique identifier for this request. Ensures that even if multiple requests fail and are retried, only a single file is created. For more information, see Nonces.

Outputs

  • id string ID of the created file object (i.e. a string in the form "file-xxxx")

Errors

  • InvalidInput

    • A reserved linking string ("$dnanexus_link") appears as a key in a hash in details but is not the only key in the hash

    • A reserved linking string ("$dnanexus_link") appears as the only key in a hash in details but has value other than a string

    • The key "media" (if provided) contains at least one character outside of the ASCII range 33-126)

    • For each property key-value pair, the size, encoded in UTF-8, of the property key may not exceed 100 bytes and the property value may not exceed 700 bytes

    • A nonce was reused in a request but some of the other inputs had changed signifying a new and different request

    • A nonce may not exceed 128 bytes

  • PermissionDenied

    • UPLOAD access required

    • File creation restricted to job context in externalUploadRestricted project

    • Project's defaultSymlink drive is not accessible to perform this action

    • Action failed because CreateMultiPartUpload is not available for this drive

  • InvalidType

    • project is not a project ID

  • ResourceNotFound

    • The specified project is not found

    • The route in folder does not exist, and parents is false

API method: /file-xxxx/upload

Specification

Informs the system that a file part (identified by a particular index) needs to be uploaded, and retrieves a "part upload URL" (specific to this part) for performing the upload of that part. This method needs to be called at least once during the file object lifecycle. Once this method is called for a particular index, then data for that part must be provided to the corresponding part upload URL before calling the "close" method.

The part upload URL returned by this method may refer to a different endpoint than the DNAnexus API server, and accepts HTTP PUT requests supplying the binary data for the file part. Any PUT request to the part upload URL must be initiated shortly after its generation, or else a new URL for the part must be generated with another call to upload. The PUT request MUST include all HTTP headers that are specified in the API server's response to upload (see below). A "Content-Type" header should not be supplied, since the Internet Media Type is not set separately for each part.

The part upload URL has support for CORS with the following configuration:

  • SSL is required (from an origin served over https)

  • Part uploads must use the HTTP PUT method

  • Allowed HTTP headers

    • content-length

    • origin

    • content-md5

    • accept

    • content-type

    • x-amz-server-side-encryption

If the request to a part upload URL completes successfully, an HTTP response with a response code in the 2xx range will be returned, with a blank response body. If the upload is unsuccessful, an HTTP response with an error response code will be returned.

This method may be called multiple times with the same index parameter. The system maintains a state for each part, which can be either "pending" or "complete". The first time this method is called, the state of the respective part is set to "pending". If the request completes successfully, and in the meantime no other request has been made to that part upload URL, then the state is eventually set to "complete". However, users are allowed to make multiple upload requests to the same part index multiple times (to reupload a piece), in which case subsequent upload requests will reset the state back to "pending". If multiple overlapping requests are made to that part URL, the last successful request is considered the canonical one, hence the part will be pending or complete based on the fate of that last request.

All parts, except the part with the highest index, have a minimum size given by the fileUploadParameters.minimumPartSize field of the /project-xxxx/describe output. If the fileUploadParameters.emptyLastPartAllowed field of the /project-xxxx/describe has the value false, then the last part must contain at least 1 byte.

All parts have a maximum size given by the fileUploadParameters.maximumPartSize field of the /project-xxxx/describe output.

Inputs

  • size int The size in bytes of this file part

  • md5 string Hex encoding of the file part's MD5 message-digest

  • index int (optional, default 1) Number that determines the relative ordering of parts during the concatenation process that occurs in close. This must be at least 1, and at most the value fileUploadParameters.maximumNumParts returned by /project-xxxx/describe.

Outputs

  • url string A URL (of the https scheme) to which data may be sent via HTTP PUT

  • expires timestamp Time at which url will expire; this will be a couple of minutes in the future

  • headers mapping HTTP headers which must be supplied with any PUT request to url

    • key Header field name

    • value string Header value

    • Security note: the headers may include authentication tokens, and therefore should not be stored, logged, printed to console, etc. in production applications

Errors

  • PermissionDenied

    • UPLOAD access required

    • File upload restricted to job context in externalUploadRestricted project

  • InvalidInput

    • size must be a non-negative integer, no greater than fileUploadParameters.maximumPartSize

    • If fileUploadParameters.emptyLastPartAllowed is false, size must be at least min(fileUploadParameters.minimumPartSize, 1)

    • md5 must be a hex string of the appropriate length

    • index (if provided) must a positive integer, no greater than fileUploadParameters.maximumNumParts

  • InvalidState

    • The file object is not in the open state

API method: /file-xxxx/describe

Specification

Describes a file object (see also /record-xxxx/describe). Returns, among others, the Internet Media Type of the file as well as the state of the file object. If the file object is in the "closed" state, the file size is reported as well. If the "parts" key in input map is "true", or the file object is in the "open" state, the response contains a "parts" key, whose value is a map describing the status of the parts that the system knows about. More specifically, for every part that the system has been informed via an "upload" call, the "parts" map contains a key corresponding to the part index (represented as a string), whose value is a map with the part status. This includes the state, size, and md5 of the part. The state can be either "pending" or "complete".

Alternatively, you can use the /system/describeDataObjects method to describe a large number of data objects at once.

As mentioned in the description of the "upload" call, a part can be in the "pending" state for any of the following reasons:

  • A PUT to its part upload URL has not been successfully completed.

  • An earlier PUT to its part upload URL has been successfully completed, but the request initiated last is either still ongoing or failed.

A part will be in the "complete" state when a PUT to its part upload URL has been successfully completed. In that case, the amount of data received in that request is shown in the "size" field, and MD5 hash of the data received is shown in the "md5" field (which are otherwise set to null, when the part is in "pending" state).

A project ID can be given to request user-provided metadata from a particular project and will be treated as a hint, i.e. if the specified project does not contain the object and another project is found which does contain it and for which the user has VIEW permissions, this other project is used instead to return the metadata. The project ID of the project used to return the user-provided metadata is always returned, regardless of whether it was the same as the hint provided. Details can also be requested via this method, but if the requestor does not have VIEW access, they will not be returned.

Files that are dispensed by a third-party data provider may be watermarked. The content of a watermarked file is determined by the id of the file, and by the watermarkId and watermarkVersion associated with the file in a specific project. The third-party data provider may update the watermark to a new version from time to time, thus changing the content of the watermarked file associated with the changed watermark.

Inputs

  • project string (optional) Project or container ID to be used as a hint for finding the object in an accessible project. This field should be provided to get consistent output for watermarked files.

  • defaultFields boolean (optional, default false if fields is supplied, true otherwise) whether to include the default set of fields in the output (the default fields are described in the "Outputs" section below). The selections are overridden by any fields explicitly named in fields.

  • fields mapping (optional) include or exclude the specified fields from the output. These selections override the settings in defaultFields.

    • key Desired output field; see the "Outputs" section below for valid values here

    • value boolean whether to include the field

The following options are deprecated (and will not be respected if fields is present):

  • parts boolean (optional, default true if file is in the "open" state and false otherwise) Whether additional information for each part should be returned

  • properties boolean (optional, default false) Whether the properties should be returned

  • details boolean (optional, default false) Whether the details should also be returned

Outputs

  • id string The object ID (i.e. the string "file-xxxx")

The following fields are included by default (but can be disabled using fields or defaultFields):

  • project string ID of the project or container in which the object was found

  • class string The value "file"

  • types array of strings Types associated with the object

  • created timestamp Time at which this object was created

  • state string The value "open", "closing", or "closed"

  • hidden boolean Whether the object is hidden or not

  • links array of strings The object IDs that are pointed to from this object

  • name string The name of the object

  • folder string The full path to the folder containing the object

  • sponsored boolean Whether the object is sponsored by DNAnexus

  • tags array of strings Tags associated with the object

  • modified timestamp Time at which the user-provided metadata of the object was last modified

  • media string The Internet Media Type of the file

  • archivalState string The archival state of the file

  • createdBy mapping How the object was created

    • user string ID of the user who created the object or launched an execution which created the object

    • job string present if a job created the object ID of the job that created the object

    • executable string present if a job created the object ID of the app or applet that the job was running

  • drive string The drive ID that the file is located in

  • symlinkPath mapping Remote path of the symlink

    • container string The container name (i.e. region:bucket for AWS S3 and containerName for Azure Blob)

    • object string The remote path of the symlink

  • md5 string Hex encoding of the whole file part's MD5 message-digest (Note this is only for readable symlink files)

The following field is included by default if the file is open:

  • parts mapping Information on the file parts that have been or are being uploaded

    • key Part index that has been provided to any /file-xxxx/upload calls on the file so far

    • value mapping Information on the file part with key/values:

      • state string Either "pending" or "complete"

      • size int or null The size of the part (in bytes) if state is "complete"; null otherwise

      • md5 string or null The hexadecimal encoded value of MD5 message-digest (as defined in RFC 1321) of the data if state is "complete"; null otherwise

The following field (included by default) is only available if the object is in the "closed" state:

  • size int Size of the file in bytes

The following field (included by default) is available if the object is sponsored by a third party:

  • sponsoredUntil timestamp Indicates the expiration time of data sponsorship (this field is only set if the object is currently sponsored, and if set, the specified time is always in the future)

The following fields are only returned if the corresponding field in the fields input is set to true:

  • properties mapping Properties associated with the object

    • key Property name

    • value string Property value

  • details mapping or array Contents of the object’s details

  • watermarkId string ID of the watermark applied to the file's content during download

  • watermarkVersion string version of the watermark's content applied to the file's content during download

  • resolvedPolicies mapping A mapping of policies affecting file-xxxx within the scope of a single project. Note that project must be specified in the input to receive consistent results. Also note that policies may be updated by your data provider at any time. Fields in this mapping will be:

    • isExternalDownloadable boolean True if file-xxxx is able to be downloaded, false otherwise.

Errors

  • ResourceNotFound

    • project, if specified, does not exist

  • PermissionDenied

    • VIEW access required to some project that contains the file object

    • If project is specified, VIEW access is required to that project

API method: /file-xxxx/close

Specification

Initiates finalization of the file object, if it is not already in the "closed" state.

To close a file object, there must be at least one part, and all of the parts must be in the "complete" state. If this call is successful, it will return immediately and the file object will advance to the "closing" state. The system will "concatenate" the parts, in order of increasing part index (and those indices do not have to be consecutive). Later, when the system is done, the file object will advance to the "closed" state. For a more detailed discussion please refer to the section "Uploading".

All parts, except the part with the highest index, have a minimum size given by the fileUploadParameters.minimumPartSize field of the /project-xxxx/describe output.

The part with the highest index must contain at least one byte if fileUploadParameters.emptyLastPartAllowed is false.

The total file size cannot exceed the size given by the fileUploadParameters.maximumFileSize field of the /project-xxxx/describe output.

If fileUploadParameters.emptyLastPartAllowed is true, there must be at least one part.

If you call this method on a file in the "closed" state, the call will have no effect. The call will report success and the detail field will be set as shown in "Outputs" below.

Inputs

None

Outputs

  • id string ID of the manipulated object (i.e. the string "file-xxxx")

If the object is in the closed state:

  • detail string String containing an explanatory message

Errors

  • PermissionDenied

    • UPLOAD access required

    • File closing restricted to job context in externalUploadRestricted project

  • InvalidState

    • fileUploadParameters.emptyLastPartAllowed is true and there are zero parts

    • At least one part is in the "pending" state

    • There exists a part, other than the one with the highest part index, whose size is less than fileUploadParameters.minimumPartSize bytes

    • fileUploadParameters.emptyLastPartAllowed is false and the part with the highest index has 0 bytes

    • The file has size larger than fileUploadParameters.maximumFileSize bytes

API method: /file-xxxx/download

Specification

Generates a "download URL" for downloading the contents of this file object. The download URL may refer to a different endpoint than the DNAnexus API server, and accepts HTTP GET requests.

Requests to the download URL must be initiated within the number of seconds specified in the "duration" input parameter (starting from the time this call is made, according to the server), after which the URL will expire. GET requests MUST include any headers specified in the API server's response to /file-xxxx/download (see below). The download URL also honors the "Range" HTTP request headers, allowing clients to download only a particular byte range of the file.

Paths should include the project context, such as project-xxxx:file-yyyy, or project-xxxx:/path/to/file.txt.

The download URL has the following support for CORS:

  • If a GET request to download URL includes the "Origin" header, its contents will be propagated into the "Access-Control-Allow-Origin" header of the response.

  • Preflight requests (OPTIONS requests to a part upload URL, with appropriate extra headers as defined in the CORS draft) will be accepted if the value of the "Access-Control-Request-Method" header is "GET". The values of "Origin" and "Access-Control-Request-Headers" (if any) of the request, will be propagated to "Access-Control-Allow-Origin" and "Access-Control-Allow-Headers" respectively in the response. The "Access-Control-Max-Age" of the response is set to 1 hour.

Successful calls to the download URL will return the HTTP response code 200, and will include a "Content-Type" header, set to whatever Internet Media Type was specified when the file object was created, and a "Content-Disposition: attachment" header that may also include a filename, if requested (see below). The request may include the query string "?inline" to override the Content-Disposition header. Unsuccessful requests will return an HTTP error response code (and in that case there are no guarantees about the response body, as the download URL does not necessarily conform to the general API rules regarding error messages).

Inputs

  • duration int (optional, default is equivalent to 24 hours) Number of seconds (starting from the time this call is made, according to the server) during which the generated URL will be valid

  • filename string (optional) The desired filename of the downloaded file, to be affixed to the returned URL. If provided, this filename will be encoded as a URI component and affixed to the download URL, whose resource portion will end in e.g. '/filename', to ease downloads through web browsers and utilities such as wget.

  • project string (optional) ID of a project containing the file, with which the download URL will be associated. Requests to the download URL will succeed only so long as 1. the file still resides in this project and 2. the user who generated the URL still has at least VIEW permission to this project. If this value is not provided, the URL will work so long as the file resides in any project to which the user who generated the URL has at least VIEW permission. This field must be provided to get the download URL for a watermarked file when invoked outside the context of a DNAnexus job.

  • preauthenticated boolean (optional, default false) Whether to generate a "preauthenticated" download URL, which embeds any necessary authentication information in the URL itself, rather than requiring separate request headers

    • Security note: URLs generated in this way intrinsically provide access to the file data to anyone in possession. Therefore, they should not be unnecessarily stored, logged, printed to console, etc. in production applications.

    • For security reasons, preauthenticated URLs should be project specific.

  • stickyIP boolean (optional if preauthenticated is true; required to be false otherwise; default false) Whether HTTP GET requests to the preauthenticated download URL should be restricted to a single origin IP address. If stickyIP and preauthenticated are true, then the first HTTP GET request to the preauthenticated download URL will dictate the IP address from which all subsequent requests must originate.

Outputs

  • url string An absolute URL to which HTTP GET requests can be made to download the file

  • headers mapping HTTP headers which MUST be supplied with any GET request to url

    • key Header field name

    • value string Header value

    • Security note: the headers may include authentication tokens, and therefore should not be stored, logged, printed to console, etc. in production applications.

    • Note: if a preauthenticated URL was requested, then no keys will be present.

Errors

  • ResourceNotFound

    • project is specified but the file object is not in the specified project

  • PermissionDenied

    • VIEW access required to some project that contains the file object

    • If project is specified, VIEW access is required to that project

  • InvalidInput

    • duration (if provided) is not a positive integer

  • InvalidState

    • The file object is not in the "closed" state

File downloads in web applications

To generate non-preauthenticated file download URLs, web applications (running inside web browsers) should make /file-xxxx/download requests to the separate endpoint https://dl.dnanex.us instead of https://api.dnanexus.com. Browser requests to non-preauthenticated file download URLs are authenticated by means of a URL-specific cookie, set by the API server's response to the /file-xxxx/download route on this separate endpoint.

Non-browser-based applications implementing the above specification, or web applications only needing preauthenticated download URLs, may call /file-xxxx/download on https://api.dnanexus.com as usual.

Last updated

Copyright 2024 DNAnexus