Skip to main content Skip to complementary content

Creating a dataset

How to create a dataset from scratch.

Procedure

  1. Go to Datasets > Add dataset.
  2. In the Add a new dataset panel, give a name to your dataset and select the connection in which you want to create your dataset.
    If you want to add a dataset from a connection that does not exist yet, you can create this connection directly from the connection drop-down list.
  3. Add a description if needed, and fill in the required properties of the dataset.
    • For S3 and HDFS file storage connections, an Auto detect button allows you to automatically detect and fill in the format of your data (CSV, Excel, Avro, or Parquet).

    • The database query and table types are not compatible as you cannot use a query type database as a Destination dataset. Therefore if you try to change the database configuration to another type after saving it, a check will be triggered on your pipeline to see whether this operation is possible.

  4. (Optional) Click View sample to see a preview of the first records of your dataset sample.
  5. Click Validate to save your dataset.

Results

The new dataset is added to the list on the Datasets page and is ready to be used.
Once created, you can go to the dataset detailed view to display a sample of your data in different formats:
  • Grid: from this view you can display the first 10 000 records of your data in tabular form
  • Hierarchy: from this view you can display the first 10 000 records of your data in a tree-like structure
  • Raw: from this view you can display an untouched and unfiltered version of the first 10 000 records of your data

Creating a local dataset

Import any local CSV, Excel, Avro, or Parquet file directly to your inventory. Datasets from various connections can be added via the Add dataset button, but if you want to simply import one of your local files, you can easily do so with the Drag a file or browse button.

Procedure

To directly import a local dataset, you can either:
  • Drag your local file and drop it anywhere on the dataset screen.
    Drag-and-drop screen
    Screen allowing you to drop your local file in your Cloud app.
  • Click the Drag a file or browse button to open your explorer and select the file to import.
    'Drop a file or browse' button you can click to select the file to import.

Your file is uploaded and the local dataset is created. The Overview page directly opens. If a local connection has not been set up yet, it will be created on the fly.

This new connection will rely on the Cloud Engine for Design when possible, and only use an existing Remote Engine Gen2 if it is the only available one.

In the case that you already own a local connection, the local import will preferably rely on the oldest one created on the Cloud Engine for Design, and will use the one created on a Remote Engine Gen2 if necessary.

However, if no engine is available at import time, the local import will be disabled.

The information about the csv file properties, such as the escape character, field delimiter, etc. has been automatically detected in the background, but you can change it anytime in the dataset properties.

Results

Your local file is added to the list of datasets, and a Local connection is created if you did not own any beforehand.

Creating a test dataset

How to create a dataset based on the schema that you enter manually.

Test datasets are useful for supplying a fixed set of values without requiring a real-life record store, making them simple to try out the product.

Procedure

  1. Go to Datasets > Add dataset .
  2. In the Add a new dataset panel, give a name to your Test dataset.
  3. Select the Test connection you have previously created in which you want to add your data.
  4. Select the format of your data:
    • CSV: in that case the expected format for the schema fields is the following:
      • must begin with [A-Za-z_] characters
      • can only contain [A-Za-z0-9_] characters
      • can only be separated by semicolons
      Example: First_Name;Last_Name;Phone1;Phone2;Address;State;Company
      Information noteNote: If you do not specify a format, a generic one will be created by default.
    • JSON: in that case you must respect a specific format for your JSON values and be consistent: sequence of records, one after another, separated, or not, by a line feed. Each record does not need to be on a single line. At the end, the data in the text area is not a typical JSON document with square brackets.

      Example:

        {
          "Id": 3146717,
          "PosTime": 1525097499899,
          "Latitude": 48.8585,
          "Longitude": 2.4921,
          "Operator": "Air France"
        }
        {
          "Id": 3757865,
          "PosTime": 1525097474634,
          "Latitude": 48.5018,
          "Longitude": 2.2246,
          "Operator": "Lufthansa"
        }
    • AVRO: in that case you also must enter the schema of your Avro records, which is a JSON document with a specific syntax described in the Avro Apache documentation.
  5. In the Values area, type in or paste your data.
    The size of your data cannot exceed 32 kilobytes.
    New dataset configuration page with manually-entered JSON values.
  6. (Optional) Click View sample to check that your data is valid.
  7. Click Validate to save your dataset.

Results

You are redirected to the dataset Overview panel where different information and metadata are displayed.

To visualize and understand the content of the dataset, open the Sample panel. You can then check that your data is valid.

Dataset sample panel
Sample panel showing a table view of the dataset JSON values.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!