Integrating data into the system is key to making your machine learning models work effectively. This process involves setting up a data source that can be automated once configured. All operations occur at the dataset level, so understanding how to create and manage datasets and data sources is crucial.


Overview of the Process

Here’s an outline of the steps involved in setting up a data source:

1

Create a Dataset

This is the container where your data sources will live. Think of it as a project-level container that can hold multiple data sources.

This image shows how to create a new dataset in the system. Datasets allow you to organize your data sources for different business cases.

2

Add a Data Source

Within the dataset, create multiple data sources to load the necessary data.

This image displays the various data source connection options such as CSV, JSON, Google BigQuery, and Microsoft SQL Server.

3

Select the Source

Choose the data source you want to integrate from the list of supported sources:

  • CSV
  • JSON
  • Google BigQuery
  • Google Cloud Storage
  • Microsoft SQL Server
  • SFTP
  • Agillic
  • Active Campaign
4

Enter Credentials

Input the credentials required to connect to the data source. This might involve entering project IDs, API keys, or authentication tokens, depending on the source.

This image shows examples of credential input for Google BigQuery and Microsoft SQL Server.

5

Preview Data

Once connected, preview the data to ensure the source is correctly connected and the data format is as expected.

In this image, you can see a preview of the data fields (e.g., ContactID, Gender, Birthdate) before confirming the connection.

6

Create Data Mapping

After previewing, map the data fields to specific entities in the allyy data structure. Mapping allows the system to understand the relationship between fields in your data and allyy’s internal structure (Contacts, Responses, Offers, etc.).

Read more about data mapping here.

This image shows the mapping process, where fields like ContactID and OfferResponse are mapped.

7

Save the Mapping

Once the mapping is saved, the data source can be accessed or synchronized at any time.

8

Data Synchronization or Workflow

For batch data, click the synchronize button to pull in data periodically or on-demand.

For streaming data, use the start streaming or stop streaming buttons to manage real-time data ingestion.

9

View Details and History

You can inspect the data source for details such as source information, last sync status, errors, and history of uploads.


Streaming vs Batch:

Batch Data

Synchronization happens manually or on a schedule. To pull in data, click the synchronize button or schedule it via a workflow.

Streaming Data

Data flows in real time. Use the start/stop streaming buttons to manage continuous data flow. Streaming data cannot be scheduled.


Conclusion

Setting up data sources allows for seamless integration of external data into the system. By following this process, you can configure a data source once and reuse it indefinitely, either by manually pulling in data or using real-time streaming. Ensuring that the data is correctly mapped to the Allyy data structure is crucial for leveraging it effectively in models and predictions.