Files
Last updated
Last updated
Copyright 2024 DNAnexus
A file object can be used to store an opaque array of bytes (i.e. what is traditionally known as a "file"). File objects contain binary data, and are immutable. After a file has been uploaded, its contents cannot be modified.
File objects are stateful. The following diagram represents the possible states (boxes) and actions. When a new file object is made (by calling new), it is initially empty and its state is "open". In that state, file contents can be uploaded in multiple parts (by calling upload), until a request is made to finalize the file (by calling close). File object finalization is not instantaneous, hence the file object advances to the "closing" state and remains in that state for as long as it is needed. In that state, contents may not be uploaded or downloaded, until the system has finalized the file. Once finalization is done by the system, the file object will advance to the "closed" state. In that state, file contents can be downloaded (by calling download). Files that are in the "open" or "closing" state for too long without any activity will be considered abandoned and deleted after some time. The user will receive a notification of such stale files after 24 hours, and after a few days the files will be deleted.
In this modern era of genomics, datasets are very large, and transferring an incredibly large file over a single HTTP call can become a daunting task. For this reason, uploading files in multiple, smaller parts is the de facto way of introducing files into the DNAnexus platform, and has many benefits; it allows for a robust, resumeable and parallelizable upload experience. Therefore, DNAnexus supports uploading data in parts, and to encourage efficient upload practices, the system limits each part size to the 5MB-5GB range.
The upload call takes several arguments that indicate which part is to be uploaded and other information specific to that part. The server returns a preauthenticated upload URL specific to that file object and part index, along with several headers that the client must provide with the subsequent HTTP PUT. The user can then upload the part to that URL by doing an HTTP PUT with the content of the part (such as when using "curl -T -X PUT"), along with the headers returned to the client, without providing any other special authentication headers. Users are allowed to upload the same part multiple times (by performing both an upload and matching PUT for a part more than once), but only the last successful PUT will be considered canonical.
The close call will perform finalization, effectively "concatenating" the parts so that once the file is closed, the part distinction is no longer there, and the original file can be subsequently downloaded using the download call. Parts are concatenated in order of ascending part index. Indices do not need to be consecutive.
Closing a file object is only possible if all parts have been uploaded, i.e. if for every index supplied in any upload call so far, the user has successfully performed a PUT to the respective URL received. Closing will not conclude until all parts have been succesfully uploaded. If the user does not complete a part upload for any file part previously created through an upload call, the close call will succeed but the file will remain stuck in "closing" until enough time passes for it to be considered abandoned, and then deleted. Therefore it is imperative that close only be called after all parts have been successfully uploaded.
When all parts have been successfully uploaded, the closing process will usually take on the order of a few seconds to minutes, depending on the size of the file. In rare cases it can take much longer.
The user can query the status of the file object by using the describe call.
There are certain limits on part sizes and numbers. These limits are given by the fileUploadParameters
field of the /project-xxxx/describe output of the project (or container) that contains the file:
Parts have a maximum size, in bytes
Parts may have a minimum size, in bytes
The completed file has a maximum size, in bytes
There is a maximum number of parts that may be uploaded
There may be a minimum number of parts that may be uploaded
See the documentation of /project-xxxx/describe for further details about how to interpret it; the client should call this route before beginning the upload to obtain the appropriate limits and break the file into appropriately sized chunks.
For reference, the default parameters (for projects whose region
begins with "aws:") are the following:
maximumPartSize
: 5368709120 (5 GiB)
minimumPartSize
: 5242880 (5 MiB)
maximumFileSize
: 5497558138880 (5 TiB)
maximumNumParts
: 10000
emptyLastPartAllowed
: true
The download call returns a preauthenticated URL which can be used to download the file via a simple HTTP GET. The service behind that URL supports the "Range" header of the HTTP standard, allowing for any byte range to be downloaded, and enabling compatibility with download accelerators that fetch multiple ranges in parallel to increase throughput.
If a file object is removed from the project before it is closed, then in addition to removing this file object from the system, the operation results in the following actions:
Any previously generated upload or download URLs are invalidated.
Existing connections to previously generated upload or download URLs
may close or fail with a 500 code (the exact behavior is undefined).
ACCOUNTING NOTE: A file increases byte usage by its size. Byte usage is counted upon upload completion.
/file/new
Creates a new file object. The file is initially in the "open" state. This call can optionally receive an Internet Media Type to associate with that file object. (DNAnexus uses this solely for the purpose of supplying the "Content-Type:" HTTP header when responding to download requests of the file. The Internet Media Type is used by web browsers to identify the kind of data stored in a file, and aid them in deciding what to do with the file when fetching it in their context). All values are accepted without further validation (and sent back as-is in the "Content-Type:" header when a file is downloaded), so long as they contain only characters in the ASCII range 33-126. If the "media" field is not provided, or is set to "", then the system will attempt to auto-detect the Internet Media Type.
project
string ID of the project or container to which the record should belong (e.g. the string "project-xxxx")
name
string (optional, default is the new ID) The name of the object
tags
array of strings (optional) Tags to associate with the object
types
array of strings (optional) Types to associate with the object
hidden
boolean (optional, default false) Whether the object should be hidden
properties
mapping (optional) Properties to associate with the object
key Property name
value string Property value
details
mapping or array (optional, default { }) JSON object or array that is to be associated with the object; see the Object Details section for details on valid input
folder
string (optional, default "/") Full path of the folder that is to contain the new object
parents
boolean (optional, default false) Whether all folders in the path provided in folder
should be created if they do not exist
media
string (optional, default "") The Internet Media Type (formerly known as MIME type or Content-type) of the file
nonce
string (optional) Unique identifier for this request. Ensures that even if multiple requests fail and are retried, only a single file is created. For more information, see Nonces.
id
string ID of the created file object (i.e. a string in the form "file-xxxx")
InvalidInput
A reserved linking string ("$dnanexus_link") appears as a key in a hash in details
but is not the only key in the hash
A reserved linking string ("$dnanexus_link") appears as the only key in a hash in details
but has value other than a string
The key "media" (if provided) contains at least one character outside of the ASCII range 33-126)
For each property key-value pair, the size, encoded in UTF-8, of the property key may not exceed 100 bytes and the property value may not exceed 700 bytes
A nonce
was reused in a request but some of the other inputs had changed signifying a new and different request
A nonce
may not exceed 128 bytes
PermissionDenied
UPLOAD access required
File creation restricted to job context in externalUploadRestricted
project
Project's defaultSymlink
drive is not accessible to perform this action
Action failed because CreateMultiPartUpload
is not available for this drive
InvalidType
project
is not a project ID
ResourceNotFound
The specified project is not found
The route in folder
does not exist, and parents
is false
/file-xxxx/upload
Informs the system that a file part (identified by a particular index) needs to be uploaded, and retrieves a "part upload URL" (specific to this part) for performing the upload of that part. This method needs to be called at least once during the file object lifecycle. Once this method is called for a particular index, then data for that part must be provided to the corresponding part upload URL before calling the "close" method.
The part upload URL returned by this method may refer to a different endpoint than the DNAnexus API server, and accepts HTTP PUT requests supplying the binary data for the file part. Any PUT request to the part upload URL must be initiated shortly after its generation, or else a new URL for the part must be generated with another call to upload. The PUT request MUST include all HTTP headers that are specified in the API server's response to upload (see below). A "Content-Type" header should not be supplied, since the Internet Media Type is not set separately for each part.
The part upload URL has support for CORS with the following configuration:
SSL is required (from an origin served over https)
Part uploads must use the HTTP PUT method
Allowed HTTP headers
content-length
origin
content-md5
accept
content-type
x-amz-server-side-encryption
If the request to a part upload URL completes successfully, an HTTP response with a response code in the 2xx range will be returned, with a blank response body. If the upload is unsuccessful, an HTTP response with an error response code will be returned.
This method may be called multiple times with the same index parameter. The system maintains a state for each part, which can be either "pending" or "complete". The first time this method is called, the state of the respective part is set to "pending". If the request completes successfully, and in the meantime no other request has been made to that part upload URL, then the state is eventually set to "complete". However, users are allowed to make multiple upload requests to the same part index multiple times (to reupload a piece), in which case subsequent upload requests will reset the state back to "pending". If multiple overlapping requests are made to that part URL, the last successful request is considered the canonical one, hence the part will be pending or complete based on the fate of that last request.
All parts, except the part with the highest index, have a minimum size given by the fileUploadParameters.minimumPartSize
field of the /project-xxxx/describe output. If the fileUploadParameters.emptyLastPartAllowed
field of the /project-xxxx/describe has the value false
, then the last part must contain at least 1 byte.
All parts have a maximum size given by the fileUploadParameters.maximumPartSize
field of the /project-xxxx/describe output.
size
int The size in bytes of this file part
md5
string Hex encoding of the file part's MD5 message-digest
index
int (optional, default 1) Number that determines the relative ordering of parts during the concatenation process that occurs in close. This must be at least 1, and at most the value fileUploadParameters.maximumNumParts
returned by /project-xxxx/describe.
url
string A URL (of the https scheme) to which data may be sent via HTTP PUT
expires
timestamp Time at which url
will expire; this will be a couple of minutes in the future
headers
mapping HTTP headers which must be supplied with any PUT request to url
key Header field name
value string Header value
Security note: the headers may include authentication tokens, and therefore should not be stored, logged, printed to console, etc. in production applications
PermissionDenied
UPLOAD access required
File upload restricted to job context in externalUploadRestricted
project
InvalidInput
size
must be a non-negative integer, no greater than fileUploadParameters.maximumPartSize
If fileUploadParameters.emptyLastPartAllowed
is false
, size
must be at least min(fileUploadParameters.minimumPartSize, 1)
md5
must be a hex string of the appropriate length
index
(if provided) must a positive integer, no greater than fileUploadParameters.maximumNumParts
InvalidState
The file object is not in the open
state
/file-xxxx/describe
Describes a file object (see also /record-xxxx/describe). Returns, among others, the Internet Media Type of the file as well as the state of the file object. If the file object is in the "closed" state, the file size is reported as well. If the "parts" key in input map is "true", or the file object is in the "open" state, the response contains a "parts" key, whose value is a map describing the status of the parts that the system knows about. More specifically, for every part that the system has been informed via an "upload" call, the "parts" map contains a key corresponding to the part index (represented as a string), whose value is a map with the part status. This includes the state, size, and md5 of the part. The state can be either "pending" or "complete".
Alternatively, you can use the /system/describeDataObjects method to describe a large number of data objects at once.
As mentioned in the description of the "upload" call, a part can be in the "pending" state for any of the following reasons:
A PUT to its part upload URL has not been successfully completed.
An earlier PUT to its part upload URL has been successfully completed, but the request initiated last is either still ongoing or failed.
A part will be in the "complete" state when a PUT to its part upload URL has been successfully completed. In that case, the amount of data received in that request is shown in the "size" field, and MD5 hash of the data received is shown in the "md5" field (which are otherwise set to null, when the part is in "pending" state).
A project ID can be given to request user-provided metadata from a particular project and will be treated as a hint, i.e. if the specified project does not contain the object and another project is found which does contain it and for which the user has VIEW permissions, this other project is used instead to return the metadata. The project ID of the project used to return the user-provided metadata is always returned, regardless of whether it was the same as the hint provided. Details can also be requested via this method, but if the requestor does not have VIEW access, they will not be returned.
Files that are dispensed by a third-party data provider may be watermarked. The content of a watermarked file is determined by the id
of the file, and by the watermarkId
and watermarkVersion
associated with the file in a specific project. The third-party data provider may update the watermark to a new version from time to time, thus changing the content of the watermarked file associated with the changed watermark.
project
string (optional) Project or container ID to be used as a hint for finding the object in an accessible project. This field should be provided to get consistent output for watermarked files.
defaultFields
boolean (optional, default false if fields
is supplied, true otherwise) whether to include the default set of fields in the output (the default fields are described in the "Outputs" section below). The selections are overridden by any fields explicitly named in fields
.
fields
mapping (optional) include or exclude the specified fields from the output. These selections override the settings in defaultFields
.
key Desired output field; see the "Outputs" section below for valid values here
value boolean whether to include the field
The following options are deprecated (and will not be respected if fields
is present):
parts
boolean (optional, default true if file is in the "open" state and false otherwise) Whether additional information for each part should be returned
properties
boolean (optional, default false) Whether the properties should be returned
details
boolean (optional, default false) Whether the details should also be returned
id
string The object ID (i.e. the string "file-xxxx")
The following fields are included by default (but can be disabled using fields
or defaultFields
):
project
string ID of the project or container in which the object was found
class
string The value "file"
types
array of strings Types associated with the object
created
timestamp Time at which this object was created
state
string The value "open", "closing", or "closed"
hidden
boolean Whether the object is hidden or not
links
array of strings The object IDs that are pointed to from this object
name
string The name of the object
folder
string The full path to the folder containing the object
sponsored
boolean Whether the object is sponsored by DNAnexus
tags
array of strings Tags associated with the object
modified
timestamp Time at which the user-provided metadata of the object was last modified
media
string The Internet Media Type of the file
archivalState
string The archival state of the file
createdBy
mapping How the object was created
user
string ID of the user who created the object or launched an execution which created the object
job
string present if a job created the object ID of the job that created the object
executable
string present if a job created the object ID of the app or applet that the job was running
drive
string The drive ID that the file is located in
symlinkPath
mapping Remote path of the symlink
container
string The container name (i.e. region:bucket
for AWS S3 and containerName
for Azure Blob)
object
string The remote path of the symlink
md5
string Hex encoding of the whole file part's MD5 message-digest (Note this is only for readable symlink files)
The following field is included by default if the file is open:
parts
mapping Information on the file parts that have been or are being uploaded
key Part index that has been provided to any /file-xxxx/upload calls on the file so far
value mapping Information on the file part with key/values:
state
string Either "pending" or "complete"
size
int or null The size of the part (in bytes) if state
is "complete"; null otherwise
md5
string or null The hexadecimal encoded value of MD5 message-digest (as defined in RFC 1321) of the data if state
is "complete"; null otherwise
The following field (included by default) is only available if the object is in the "closed" state:
size
int Size of the file in bytes
The following field (included by default) is available if the object is sponsored by a third party:
sponsoredUntil
timestamp Indicates the expiration time of data sponsorship (this field is only set if the object is currently sponsored, and if set, the specified time is always in the future)
The following fields are only returned if the corresponding field in the fields
input is set to true
:
properties
mapping Properties associated with the object
key Property name
value string Property value
details
mapping or array Contents of the object’s details
watermarkId
string ID of the watermark applied to the file's content during download
watermarkVersion
string version of the watermark's content applied to the file's content during download
resolvedPolicies
mapping A mapping of policies affecting file-xxxx within the scope of a single project. Note that project
must be specified in the input to receive consistent results. Also note that policies may be updated by your data provider at any time. Fields in this mapping will be:
isExternalDownloadable
boolean True if file-xxxx is able to be downloaded, false otherwise.
ResourceNotFound
project
, if specified, does not exist
PermissionDenied
VIEW access required to some project that contains the file object
If project
is specified, VIEW access is required to that project
/file-xxxx/close
Initiates finalization of the file object, if it is not already in the "closed" state.
To close a file object, there must be at least one part, and all of the parts must be in the "complete" state. If this call is successful, it will return immediately and the file object will advance to the "closing" state. The system will "concatenate" the parts, in order of increasing part index (and those indices do not have to be consecutive). Later, when the system is done, the file object will advance to the "closed" state. For a more detailed discussion please refer to the section "Uploading".
All parts, except the part with the highest index, have a minimum size given by the fileUploadParameters.minimumPartSize
field of the /project-xxxx/describe output.
The part with the highest index must contain at least one byte if fileUploadParameters.emptyLastPartAllowed
is false
.
The total file size cannot exceed the size given by the fileUploadParameters.maximumFileSize
field of the /project-xxxx/describe output.
If fileUploadParameters.emptyLastPartAllowed
is true
, there must be at least one part.
If you call this method on a file in the "closed" state, the call will have no effect. The call will report success and the detail
field will be set as shown in "Outputs" below.
None
id
string ID of the manipulated object (i.e. the string "file-xxxx")
If the object is in the closed
state:
detail
string String containing an explanatory message
PermissionDenied
UPLOAD access required
File closing restricted to job context in externalUploadRestricted
project
InvalidState
fileUploadParameters.emptyLastPartAllowed
is true
and there are zero parts
At least one part is in the "pending" state
There exists a part, other than the one with the highest part index, whose size is less than fileUploadParameters.minimumPartSize
bytes
fileUploadParameters.emptyLastPartAllowed
is false
and the part with the highest index has 0 bytes
The file has size larger than fileUploadParameters.maximumFileSize
bytes
/file-xxxx/download
Generates a "download URL" for downloading the contents of this file object. The download URL may refer to a different endpoint than the DNAnexus API server, and accepts HTTP GET requests.
Requests to the download URL must be initiated within the number of seconds specified in the "duration" input parameter (starting from the time this call is made, according to the server), after which the URL will expire. GET requests MUST include any headers specified in the API server's response to /file-xxxx/download
(see below). The download URL also honors the "Range" HTTP request headers, allowing clients to download only a particular byte range of the file.
Paths should include the project context, such as project-xxxx:file-yyyy
, or project-xxxx:/path/to/file.txt
.
The download URL has the following support for CORS:
If a GET request to download URL includes the "Origin" header, its contents will be propagated into the "Access-Control-Allow-Origin" header of the response.
Preflight requests (OPTIONS requests to a part upload URL, with appropriate extra headers as defined in the CORS draft) will be accepted if the value of the "Access-Control-Request-Method" header is "GET". The values of "Origin" and "Access-Control-Request-Headers" (if any) of the request, will be propagated to "Access-Control-Allow-Origin" and "Access-Control-Allow-Headers" respectively in the response. The "Access-Control-Max-Age" of the response is set to 1 hour.
Successful calls to the download URL will return the HTTP response code 200, and will include a "Content-Type" header, set to whatever Internet Media Type was specified when the file object was created, and a "Content-Disposition: attachment" header that may also include a filename, if requested (see below). The request may include the query string "?inline" to override the Content-Disposition header. Unsuccessful requests will return an HTTP error response code (and in that case there are no guarantees about the response body, as the download URL does not necessarily conform to the general API rules regarding error messages).
duration
int (optional, default is equivalent to 24 hours) Number of seconds (starting from the time this call is made, according to the server) during which the generated URL will be valid
filename
string (optional) The desired filename of the downloaded file, to be affixed to the returned URL. If provided, this filename will be encoded as a URI component and affixed to the download URL, whose resource portion will end in e.g. '/filename', to ease downloads through web browsers and utilities such as wget.
project
string (optional) ID of a project containing the file, with which the download URL will be associated. Requests to the download URL will succeed only so long as 1. the file still resides in this project and 2. the user who generated the URL still has at least VIEW permission to this project. If this value is not provided, the URL will work so long as the file resides in any project to which the user who generated the URL has at least VIEW permission. This field must be provided to get the download URL for a watermarked file when invoked outside the context of a DNAnexus job.
preauthenticated
boolean (optional, default false) Whether to generate a "preauthenticated" download URL, which embeds any necessary authentication information in the URL itself, rather than requiring separate request headers
Security note: URLs generated in this way intrinsically provide access to the file data to anyone in possession. Therefore, they should not be unnecessarily stored, logged, printed to console, etc. in production applications.
For security reasons, preauthenticated URLs should be project specific.
stickyIP
boolean (optional if preauthenticated
is true; required to be false otherwise; default false) Whether HTTP GET requests to the preauthenticated download URL should be restricted to a single origin IP address. If stickyIP
and preauthenticated
are true, then the first HTTP GET request to the preauthenticated download URL will dictate the IP address from which all subsequent requests must originate.
url
string An absolute URL to which HTTP GET requests can be made to download the file
headers
mapping HTTP headers which MUST be supplied with any GET request to url
key Header field name
value string Header value
Security note: the headers may include authentication tokens, and therefore should not be stored, logged, printed to console, etc. in production applications.
Note: if a preauthenticated URL was requested, then no keys will be present.
ResourceNotFound
project
is specified but the file object is not in the specified project
PermissionDenied
VIEW access required to some project that contains the file object
If project
is specified, VIEW access is required to that project
InvalidInput
duration
(if provided) is not a positive integer
InvalidState
The file object is not in the "closed" state
To generate non-preauthenticated file download URLs, web applications (running inside web browsers) should make /file-xxxx/download
requests to the separate endpoint https://dl.dnanex.us
instead of https://api.dnanexus.com
. Browser requests to non-preauthenticated file download URLs are authenticated by means of a URL-specific cookie, set by the API server's response to the /file-xxxx/download
route on this separate endpoint.
Non-browser-based applications implementing the above specification, or web applications only needing preauthenticated download URLs, may call /file-xxxx/download
on https://api.dnanexus.com
as usual.