GPT Prompts for Data Teams

Most of the prompts below are real life prompts suggested by readers, used by our team internally at Castor, found on Reddit, or gathered from conversations at data events. Not all of them will be relevant to you but the objective is to inspire people to try ChatGPT to drive productivity.

small circle patternsmall circle pattern

How to use this GPT Prompts guide?

  1. Explore the sections that are interesting to you.
  2. Do not Copy/Paste any proprietary data in ChatGPT. This can be detrimental for your company. Always generate fake data or make sure what you are sending is non-sensitive
  3. GPT4 will give you way better experience with coding scripts than GPT3. Still not perfect yet but will keep improving over time.
  4. Customize GPT to yourself before asking him anything. Write him a quick 10 lines presentation about you, what you care about & what your goals are. This will increase drastically the output of the following prompts.
  5. You might always need to tweak the code ChatGPT gives back but gets you 90% in the right direction.

Want to help?
➡️ Give feedback in the chat, in the bottom right corner.
➡️ Share it so more data teams can increase their productivity with ChatGPT

GPT Prompts for Data Engineering

Data Pipeline Development
Design, build, and maintain data pipelines to ingest, clean, transform, and store data from various sources into data storage systems, such as data warehouses or data lakes, ensuring data is available for analytics and machine learning tasks.
Generate Data

Prompt:
"I want you to act as a fake data generator. I need a dataset that has [x] rows and [y] columns: [insert column names]”

Example:

Generate Data From DDL

Prompt:

"Please help me generate sample data for the following SQL DDL table definition:

SQL DDL:[Provide your SQL DDL table definition, including table name, column names, and data types]

Based on the table definition, please generate a set of somewhat realistic sample data that can be used for testing and mock data generation. Ensure that the sample data is consistent with the meaning of the column names and adheres to the specified data types."

Join Data Set

Prompt:
I want you to act as a data engineer and code in python for me. I have a two datasets A and B. A is [explain A structure]. B is [explain B structure]. I need to join them on a foreign key [enter FK].

Example:

Create CSV→BigQuery Pipeline

Prompt:
"Act as a senior data engineer & provide a Python code sample demonstrating data engineering best practices to move data from a CSV file to BigQuery. Use the standard library when possible, but feel free to use external libraries if they significantly improve the process.”

Example:

Create CSV→BigQuery Pipeline

I had a list of immediate dependencies for jobs I wanted to reverse to find upstream sources

Prompt:
"In python I have a dependency tree in a dict. Write a script to invert that dependency tree."

Write Airflow Dag

Prompt:
"I'm working on a data pipeline using Apache Airflow, and I need to create a DAG that performs the following tasks in sequence:

  1. Extract data from an API and save it to a CSV file.
  2. Load the CSV data into a PostgreSQL database.
  3. Run an SQL query on the database to aggregate the data and generate a report.
  4. Email the report to a list of recipients.

Can you help me write an Airflow DAG that accomplishes these tasks? Please include comments explaining each part of the DAG, and assume that I have the necessary Python functions to perform the data extraction, loading, querying, and emailing tasks

"

Data Integration
Integrate data from disparate sources, such as APIs, databases, or files, to create a unified view of the data. This often involves understanding different data formats, schemas, and dealing with inconsistencies or missing data.
Regex Writing

Prompt:
"Help me solve this regex problem: I need to create a regular expression pattern that matches [specific requirement]. Can you provide a regex pattern and explain how it works?”

SQLTroubleshooting

Prompt:

"Please help me identify any issues or potential problems in the following SQL code:

[Insert SQL Code]

Analyze the provided SQL code and point out any syntax errors, logical issues, performance concerns, or best practice violations that may be present. Additionally, suggest possible improvements or fixes for the identified problems."

Generate Mermaid Diagram

Prompt:
I want you to act as a data engineer and code in python for me. I have a two datasets A and B. A is [explain A structure]. B is [explain B structure]. I need to join them on a foreign key [enter FK].

Example:

Translate Code Between DBMS

Prompt:
"Prompt: I want you to act as a coder and write SQL code for [DBMS 1]. What is the equivalent of [DBMS 2]'s DATE_TRUNC for MySQL?

Data Storage & Management
Design and manage data storage solutions, such as relational databases, NoSQL databases, or distributed file systems, to ensure data is organized, accessible, and scalable. This includes tasks like schema design, indexing, and partitioning.
Create a persistent Hive Table

Prompt:
"How to make Hive table persist in pyspark”

Example:

SQLTroubleshooting

Prompt:

"Please help me identify any issues or potential problems in the following SQL code:

[Insert SQL Code]

Analyze the provided SQL code and point out any syntax errors, logical issues, performance concerns, or best practice violations that may be present. Additionally, suggest possible improvements or fixes for the identified problems."

Create Stacks for AWS Cloud Formation

Prompt:
"Please help me create an AWS CloudFormation stack using the AWS Cloud Development Kit (CDK) in Python:

AWS Services to Include: [List the AWS services you want to include in the stack, e.g., EC2, S3, Lambda, RDS, etc.]

Stack Requirements: [Provide specific requirements for the stack, such as the desired instance types, number of instances, storage capacities, or any other configuration details]

Please provide step-by-step instructions and Python code for creating the CloudFormation stack using the AWS CDK, along with any necessary prerequisites, imports, and dependencies. Additionally, include any tips or best practices for working with the AWS CDK and CloudFormation in Python."

Data Quality & Monitoring
Implement data quality checks, validation rules, and monitoring systems to ensure the accuracy, consistency, and reliability of the data. Identify and resolve data quality issues, and set up alerts to notify the team of any issues.
Suggest Edge Cases

Prompt:
"I want you to act as a software developer. Please help me catch edge cases for this function [insert function]”

Suggest Data Quality Tests

Prompt:

"I want you to act as a software developer. Please help me catch data quality tests for this data pipeline [insert code]"

Performance Optimisation
Optimize data processing tasks, queries, and storage systems to improve performance and reduce latency. This may involve tuning database configurations, optimizing SQL queries, or leveraging big data processing frameworks such as Apache Spark.
Chain Optimization for a SQL query

Prompt 1:
"I have a SQL query that I'd like to optimize. Here's the query:SELECT * FROM ordersJOIN customers ON orders.customer_id = customers.customer_idWHERE customers.country = 'USA';

Can you help me identify any potential performance issues with this query?”

Prompt 2:

"Thank you for the feedback. I've checked and there are no indexes on the 'customer_id' column in both the 'orders' and 'customers' tables, and there's no index on the 'country' column in the 'customers' table. Should I create any indexes to improve the query's performance? If so, which columns should I index?"

Prompt 3:

"I see, that makes sense. I also noticed that I'm using 'SELECT *' in the query, which selects all columns from both tables. However, I only need a few specific columns from each table. How should I modify the query to select only the columns I need, and will this improve performance?"

Prompt 4:

"Thanks for the advice. I'm also concerned about the number of rows returned by the query. It could potentially return a large number of rows. Is there a way to paginate the results so that I only retrieve a limited number of rows at a time? How can I implement pagination in the query?"

Spark Optimization Ideas

Prompt:

"I'm working with Apache Spark to process large datasets, and I'm looking for ways to optimize the performance of my Spark jobs. Specifically, I'm interested in improving the execution time, reducing the memory footprint, and minimizing data shuffling. Can you provide me with some practical ideas and best practices for optimizing Spark jobs? Additionally, if you have any tips for tuning Spark configurations, I'd appreciate hearing them."

Optimize Pandas

Prompt:

"I want you to act as a code optimizer. Can you point out what's wrong with the following Pandas code and optimize it? [Insert code here]"

Optimize Code Perf

Prompt:

"I want you to act as a software developer. Please help me improve the time complexity of the code below. [Insert code]"

Optimize Python

Prompt:

"I want you to act as a code optimizer. The code is poorly written. How do I correct it? [Insert code here]"

Simplify Python

Prompt:

"I want you to act as a python code simplifier. Can you simplify the following code?"

Communicate Optimization ROI

Prompt:

"Assuming you are a data engineer who has optimized data pipeline processes, Code performance or SQL query output.The output is the following: [provide ROI & metrics about the optimisation] Provide a non-technical explanation highlighting the importance and benefits of these optimizations for business stakeholders, and how it can contribute to the overall success of the company.

Structure the output in 3 bullet points and less than 150 words.  Keep it data driven."

Example:

GPT Prompts for Data Governance

Data Governance Strategy
Develop and implement a comprehensive data governance strategy and framework that aligns with the organization's overall business objectives. This includes defining goals, policies, procedures, and metrics to measure success.
Getting Started with Governance

Prompt:
"Assuming a team has no existing data governance framework in place, provide a step-by-step guide on how to implement data governance from scratch, prioritizing the most important aspects first.”

Examples:

Define Data Governance Goals

Prompt:

"Act a data governance leader. You work in a company doing [industry] . Data is strategic in your company for [X, Y, Z reasons]. Your data governance practice has [number of years]. You have already succesfully implemented [Project 1, 2, 3]. You need to define data governance goals for the next quarter. You want to impact [Strategic Project A & B]"

Write Access Right Policy Framework on Snowflake

Prompt:
"Act as a security engineer from Snowflake. You want to write the Access Control Privileges for your company. Here’s the role & access levels I want to create: [Role 1: System Access 1, Schema Access 1, Object Access 1Role 2: System Access 2, Schema Access 2, Object Access 2Role 3: System Access 3, Schema Access 3, Object Access 3]"

Examples:

Write Access Right Policy Framework on Snowflake

Prompt:
"List data governance books to read"

Examples:

Summarize a book on data governance

Prompt:
"Can you give me an in-depth summary of the following book on data governance? I am already familiar to the data governance world.

[Insert Book Title & writer]"

Examples:

Ask data governance strategy based on a specific book

Prompt:
"design a data governance strategy for [Add your industry] to [add context & use case] based on the principles in this book: [add book details]"

Data Quality Measurement
Establish data quality standards, guidelines, and best practices to ensure data accuracy, consistency, and reliability. Oversee the implementation of data quality checks, validation rules, and monitoring systems to identify and resolve data quality issues.
Improve Codebase Readability

Prompt:
"I want you to act as a code analyzer. Can you improve the following code for readability and maintainability? [Insert code]”

Write Data Quality Tests

Prompt:

"Here’s a table: [insert table sample] Can you write data quality tests in SQL/python to make sure the output is consistent. Flag nulls & duplicates."

Data Quality Standards

Prompt:

"Please describe the key data quality standards you would like to establish within your company. Consider including aspects such as accuracy, completeness, consistency, timeliness, and uniqueness. For each standard, provide a brief explanation and suggest appropriate metrics or methods to measure and ensure compliance. Additionally, mention any specific industry regulations or requirements that need to be adhered to.

1. Standard Name (e.g., Accuracy)

  • Explanation: Briefly explain the importance of this standard.
  • Measurement/Compliance Method: How will you measure and ensure compliance with this standard?
  • Industry Requirements (if applicable): Any specific industry regulations to be considered.

2. Standard Name (e.g., Completeness):

  • Explanation: Briefly explain the importance of this standard.
  • Measurement/Compliance Method: How will you measure and ensure compliance with this standard?
  • Industry Requirements (if applicable): Any specific industry regulations to be considered.

[Add more standards as necessary]"

Write Access Right Policy Framework on Snowflake

Prompt:
"As a data governance expert, I am tasked with creating a training session for my company's employees on data quality best practices. The goal of this training is to educate employees on the importance of data quality, common data quality issues, and best practices for ensuring high-quality data. Please provide an outline for the training session, including key topics and explanations for each section. Make sure to cover the following areas:

  • Introduction to data quality
  • [add common data quality issues and their impact at your company]
  • [best practices for data quality management at your company]
  • Practical tips for maintaining data quality
  • Conclusion and next steps"
Data Privacy & Security
Ensure that data privacy and security policies are in place and enforced to protect sensitive information and comply with applicable regulations, such as GDPR or HIPAA. This includes overseeing access controls, encryption, and data masking techniques.
Compliance Checklist

Prompt:
"Please provide a summary of the [X compliance standard] and create a prioritized checklist to help organizations ensure their adherence to the requirements of this standard.Provide the answer in a table”

Example:

List Personal Information from Table Metadata

Prompt:

[insert output of prompt above]

Can you list all the columns that contains personal information?

Select the best encryption method for a specific dataset

Prompt:

"As an AI expert in data security, I am seeking advice on the best methods to encrypt data. My goal is to ensure the confidentiality and integrity of sensitive information. Please provide a list of recommended encryption methods, along with brief descriptions of each method and their use cases. Additionally, if there are any Python libraries that can be used to implement these encryption methods, please mention them as well."

Create a data governance assistant

Prompt:
Here’re our data governance policies:

[insert policies] Can you answer all the following questions based on what is written in this policy?

Data Stewardship
Lead a team of data stewards responsible for managing, maintaining, and documenting the organization's data assets. Ensure that data stewards are trained and have a clear understanding of their roles and responsibilities.
Classify & Tag Data Tables

Prompt:
"Generate business tags for a table named: [table name]. With the following columns: [columns name] . The query used to create the table: [insert query]. And for non-sensitive tables, you can add a data sample: [data sample].”

Example:

List Personal Information from Table Metadata

Prompt:

Organize & regroup this list of data tables by theme and business tags:[List Tables]

Write a memo after a data quality issue

Copy/Paste Jira Ticket

Prompt:

Can you write a memo to summarize the issue in this ticket? Please structure the answer in the following format.

[Your Name][Your Title/Position][Your Department][Date]

TO: [Recipient Name(s)]CC: [Optional - Other Relevant Parties to be Copied]FROM: [Your Name]SUBJECT: Data Quality Issue and Resolution

Dear [Recipient Name(s)],

I am writing to inform you of a recent data quality issue that was identified within our [data system/database] and to outline the steps taken to address and resolve the matter.

Issue Description:[Provide a brief and clear description of the data quality issue. Include details such as the nature of the problem, the data set(s) affected, and the potential impact on business operations or decision-making.]

Issue Discovery:[Explain how the data quality issue was discovered. If applicable, mention any tools or processes used to identify the issue.]

Resolution Steps:[Outline the steps taken to address and resolve the data quality issue. Include any corrective actions, data validation, or data cleansing processes that were implemented. If the issue has not been fully resolved, explain the ongoing efforts to address it.]

Preventive Measures:[Describe any preventive measures or process improvements that have been put in place to avoid similar data quality issues in the future. This may include changes to data validation rules, data governance policies, or staff training.]

Next Steps:[If applicable, outline any next steps or actions that need to be taken by the recipient(s) or other stakeholders. This may include reviewing updated data, providing feedback, or participating in meetings to discuss the issue further.]

I would like to thank [relevant team members or departments] for their prompt and diligent efforts in addressing this issue. Ensuring the accuracy and integrity of our data is a top priority, and we are committed to continuously improving our data management practices.

Please do not hesitate to reach out to me if you have any questions or require further information regarding this matter.

Thank you for your attention to this issue.

Sincerely,

Stakeholder Communication
Collaborate with various stakeholders, such as data engineers, data scientists, analysts, and business leaders, to understand their data needs and ensure that data governance initiatives support their requirements. Communicate data governance policies, updates, and best practices throughout the organization to drive awareness and adoption.
Write a Jira Ticket

Prompt:

"Please help me create a Jira ticket with the following details:

Title: [Short, descriptive summary of the issue or feature request]

Description:

  • Background: [Provide context or background information about the issue or feature request]
  • Issue/Feature: [Explain the problem or desired functionality in detail]
  • Expected behavior: [Describe what the expected outcome should be]
  • Steps to reproduce: [If applicable, list the steps required to reproduce the issue]
  • Acceptance criteria: [Clearly define the criteria that must be met for the ticket to be considered complete]
  • Additional notes: [Include any other relevant information, such as screenshots, logs, or potential solutions]"
Convert Code in a language you understand

Prompt:

Convert this code [insert code] into SQL. You can also guide me through what the code is doing.

Explain Technical Data Concepts

Prompt:

"Please help me explain the technical data concept of [Technical data concept] to a non-technical business user, focusing on the [Add context about industry] industry.

Provide a clear and concise explanation of the concept, tailored to someone without a technical background, and include a relevant example from the specified industry to help illustrate the concept's application and importance in that context."

Convince leadership to invest in tooling

Prompt:

"Compose a persuasive message to leadership advocating for the investment in a [tool], outlining the reasons for the investment, who will benefit, the estimated cost, and the expected impact on the organization.”

Example:

GPT Prompts for Data Science

Data Exploration and Preprocessing
Data scientists explore datasets to understand their structure, patterns, and potential issues. They preprocess data by cleaning, transforming, and aggregating it to prepare it for analysis.Example: A data scientist working for an e-commerce company might explore customer purchase data, clean missing values, and aggregate it by product categories to analyze sales trends.
Suggest Data

Prompt:
"I am working on a project to build a predictive model for [insert specific problem or domain] and would like to showcase my expertise in [insert specific skills or techniques]. Can you recommend the top five datasets that would be most suitable for my use case, allowing me to effectively demonstrate my knowledge and skills?”

Explore Data

Prompt:

I want you to act as a data scientist and code for me. I have a dataset of [describe dataset]. Please write code for data visualisation and exploration.

Write a Regex

Prompt:
I want you to act as a coder. Please write me a regex in python that [describe regex]

Complete SQL Code

Prompt:
I'm working on a SQL task that involves creating a series of similar tables for different months. Each table should have the same structure, but the table names should include the month and year. The structure of each table is as follows:

  • id (integer, primary key)
  • name (varchar)
  • amount (decimal)
  • date (date)

I need to create tables for the months of January, February, and March 2023. The table names should be in the format "sales_YYYY_MM" (e.g., "sales_2023_01" for January 2023). I find this task a bit repetitive and boring, so I'm hoping you can help me generate the SQL code to create these tables. Thanks!

Address Imbalanced Data

Prompt:
"I want you to act as a coder. I have trained a machine learning model on an imbalanced dataset. The predictor variable is the column [Insert column name]. In Python, how do I oversample and/or under sample my data?"

Create a Sankey Diagram

Prompt:
Please help me create a Sankey diagram with the following information:

  • Number of stages or categories: [number_of_stages]
  • Stage names: [stage_1_name], [stage_2_name], ..., [stage_n_name]
  • Connections between stages and their flow quantities:
  • From [stage_name] to [stage_name]: [quantity]
  • From [stage_name] to [stage_name]: [quantity]
  • ...
  • From [stage_name] to [stage_name]: [quantity]

Thank you!

Feature Engineering and Selection
They create relevant features from raw data that can help improve the performance of machine learning models. They also select the most important features to reduce complexity and improve model interpretability.Example: In a credit risk assessment project, a data scientist might create features such as debt-to-income ratio and credit utilization, and then use feature selection techniques to identify the most predictive features for a credit scoring model.
Train Classification Model

Prompt:
I want you to act as a data scientist and code for me. I have a dataset of [describe dataset]. Please build a machine learning model that predicts [target variable].

Get Feature Importance

Prompt:

"As a data scientist, I have trained a decision tree model using [insert model details here, e.g., dataset, libraries, and settings]. Can you help me understand the results of this model and provide Python code to identify the most important features?”

Tune Hyperparameters

Prompt:

I want you to act as a data scientist and code for me. I have trained a [model name]. Please write the code to tune the hyper parameters."

Get Data Set Structure

Prompt:
I have [insert dataset type] dataset: [copy dataset sample]. Can you describe this dataset? I want to reuse this description in another ChatGPT prompt later on.Make sure you extract in a structured format:- table name- list of columns- 3 associated business tags- 5 first lines as data sample

Train Time Series

Prompt:
I want you to act as a data scientist and code for me. I have a time series dataset [describe dataset]. Please build a machine learning model that predicts [target variable]. Please use [time range] as train and [time range] as validation.

Deployment and Maintenance
Data scientists collaborate with engineers to deploy their models into production and monitor their performance, making adjustments as needed to ensure continued accuracy and relevance.Example: A data scientist working on a fraud detection system for a bank might deploy their model using a REST API, then continually monitor its performance, updating the model as new fraud patterns emerge.
Compare Function Speed

Prompt:

I want you to act as a software developer. I would like to compare the efficiency of two algorithms that performs the same task in Python. Please write code that helps me run an experiment that can be repeated for 5 times. Please output the runtime and other summary statistics of the experiment. [Insert functions]

Improve Codebase Readability

Prompt:

I want you to act as a code analyzer. Can you improve the following code for readability and maintainability? [Insert code]

Enforce Pandas Test

Prompt:

I want you to act as a data scientist. Please write code to test if that my pandas Data frame [insert requirements here]

Write Unit Test

Prompt:

I want you to act as a software developer. Please write unit tests for the function [Insert function]. The test cases are: [Insert test cases]

Analyse Complexity

Prompt:

I want you to act as a software developer. Please compare the time complexity of the two algorithms below. [Insert two functions]

Debug

Python

Prompt:

I want you to act as a software developer. This code is supposed to [expected function]. Please help me debug this Python code that cannot be run. [Insert function]


SQL

Prompt:

I want you to act as a SQL code corrector. This code does not run in [your DBMS, e.g. PostgreSQL]. Can you correct it for me? [SQL code here]

Model Development and Evaluation
Data scientists build, train, and evaluate machine learning models to make predictions or uncover patterns in data.Example: A data scientist at a streaming service might develop a recommendation engine using collaborative filtering or matrix factorization to provide personalized content recommendations to users.
Naive Bayes Hypertuning

Prompt:

"Please help me with using a naive Bayes approach for hyperparameter tuning in Databricks:

  1. Dataset: [Provide details about the dataset, including its location, format, and features]
  2. Problem: [Specify the problem you are trying to solve, such as classification or regression]
  3. Hyperparameters: [List the hyperparameters you want to tune, such as learning rate, number of iterations, or regularization parameters]
  4. Search space: [Define the search space for each hyperparameter, e.g., ranges or specific values to be explored]
  5. Evaluation metric: [Mention the evaluation metric to be used for comparing model performance, such as accuracy, F1 score, or mean squared error]

Please provide step-by-step instructions on how to perform hyperparameter tuning using a naive Bayes approach in Databricks, including any required code snippets and best practices."

Automatic Machine Learning

Prompt:

"As a data scientist, I have trained a decision tree model using [insert model details here, e.g., dataset, libraries, and settings]. Can you help me understand the results of this model and provide Python code to identify the most important features?”

GPT Prompts for Data Analyst

Data Collection and Cleaning
Data analysts gather data from various sources, such as databases, APIs, or spreadsheets, and clean it to ensure accuracy and consistency.Example: A data analyst at a healthcare organization might collect patient data from different hospital departments, clean and standardize it to ensure consistent formatting, and merge it into a single dataset for analysis.
Generate Data

Prompt:
I want you to act as a fake data generator. I need a dataset that has [x] rows and [y] columns: [insert column names]

Output:

Generate Data From DDL

Prompt:

"Please help me generate sample data for the following SQL DDL table definition:

SQL DDL:[Provide your SQL DDL table definition, including table name, column names, and data types]

Based on the table definition, please generate a set of somewhat realistic sample data that can be used for testing and mock data generation. Ensure that the sample data is consistent with the meaning of the column names and adheres to the specified data types."

Design Panda functions

Prompt:

"Please help me perform a specific operation (x) on the following example DataFrame represented as a table in Markdown format:

[Insert Example DataFrame]

Operation (x): [Describe the desired operation, e.g., filter rows based on a condition, calculate a new column, sort the DataFrame, or group by a specific column]

Please provide the necessary Pandas code to perform the specified operation (x) on this example DataFrame, and show the resulting DataFrame after the operation is applied."

Clean Dataset

Prompt:
"Please provide a Python code snippet that demonstrates how to clean and preprocess a dataset, including handling missing values, removing duplicates, and standardizing data formats. Use a sample dataset with columns 'Name,' 'Age,' 'Gender,' and 'Email' for this demonstration.”

Merged Dataset

Prompt:
"Please provide a Python code snippet that demonstrates how to merge two datasets using the Pandas library. Assume that the first dataset, 'df1,' contains columns 'ID,' 'Name,' and 'Age,' and the second dataset, 'df2,' contains columns 'ID,' 'City,' and 'Country.' Merge the two datasets on the 'ID' column, and show the resulting merged dataset.”

Build a simple data scraper

Prompt:
"Please provide a Python code snippet that demonstrates how to scrape data from the homepage of 'www.castordoc.com' using the BeautifulSoup and requests libraries. Extract and display the page title and the text content of the main headings (e.g., h1, h2) on the page. Note: Ensure that your web scraping practices comply with the website's terms of service.Store the data in a pd dataframe"

Collect Data from an API

Prompt:
"Please provide a Python code snippet that demonstrates how to collect data from a public REST API endpoint using the 'requests' library. As an example, use the following API endpoint that returns JSON data about users: 'https://jsonplaceholder.typicode.com/users'. Retrieve the data, parse the JSON response, and display the result in a readable format."

Example:

Data Exploration and Analysis
They explore datasets to understand their structure, identify patterns, trends, and relationships, and perform statistical analyses to test hypotheses.Example: A data analyst at an e-commerce company might analyze customer purchase data to identify seasonal trends, high-performing products, and customer segments with different spending behaviors.
Explore Data

Prompt:
I want you to act as a data scientist and code for me. I have a dataset of [describe dataset]. Please build a machine learning model that predicts [target variable].

Calculate Running Average

Prompt:

"As a data scientist, I have a table with two columns: [Insert column names]. I'd like to calculate a running average for [specify the desired value or column]. Can you provide the SQL code to accomplish this in PostgreSQL 14?”

Rewrite used queries to modify them slightly

Prompt:

"Please help me modify the following SQL query to achieve a slightly different result:

[Insert Original SQL Query]

Original Query Purpose: [Describe the purpose or goal of the original SQL query]

Desired Modification: [Explain the specific modification you want to make to the query, such as changing the filtering criteria, adding or removing columns, modifying the aggregation, or altering the sorting order]

Please provide the modified SQL query that achieves the desired result, along with an explanation of the changes made and how the new query differs from the original one."

Translate SQL Dialects

Prompt:
What is the equivalent of the FUNC1 function in BigQuery?

Compare 2 similar SQL code

Prompt:
"Please help me compare the following two similar SQL queries and explain the differences between them:

[SQL QUERY 1]

[SQL QUERY 2]

Analyze both SQL queries and provide a detailed comparison that highlights the differences in terms of structure, syntax, filtering criteria, columns selected, aggregation, and any other relevant aspects. Additionally, explain how these differences may impact the results returned by each query and any potential implications for performance or data accuracy.”

Generate SQL Query

Prompt:

“As a senior data analyst, [insert schema & data sample]given the above schemas and data, write a detailed and correct [insert DBMS] sql query to answer the analytical question:

[question]

Comment the query with your logic.”

Double Check SQL Query

Prompt:

“Double check the Postgres query above for common mistakes, including:

- Remembering to add `NULLS LAST` to an ORDER BY DESC clause

- Handling case sensitivity, e.g. using ILIKE instead of LIKE

- Ensuring the join columns are correct

- Casting values to the appropriate type

Rewrite the query here if there are any mistakes. If it looks good as it is, just reproduce the original query."

Debug Query Against DB

Prompt:

[insert query from previous prompt]

The query above produced the following error:

[insert query error]

Rewrite the query with the error fixed:"

Reporting and Visualization
Data analysts create reports and visualizations to present their findings in a clear and concise manner to stakeholders, often using tools like Tableau or Power BI.Example: A data analyst working for a marketing agency might create a dashboard displaying the performance metrics of an advertising campaign, such as impressions, click-through rates, and conversions, to help clients understand the campaign's effectiveness.
Write Pyspark Struct

Prompt:

"Please help me create PySpark StructType and StructField schema definitions for the following dataset:

Dataset columns:

  1. Column Name: [Name of the first column]Data Type: [Data type of the first column, e.g., StringType, IntegerType, DoubleType, etc.]Nullable: [True/False, indicating if the first column can contain null values]
  2. Column Name: [Name of the second column]Data Type: [Data type of the second column]Nullable: [True/False, indicating if the second column can contain null values]

[Continue with further columns as needed]

Please provide the PySpark code for creating the StructType and StructField objects that define the schema for this dataset."

Choose Visualisation Method

Prompt:

”As an expert in data visualization, I need your help to choose the best visualization method for the following problem:

[PROBLEM]

Please describe the problem in detail and recommend the most appropriate visualization method to effectively communicate the information. Explain why you think this method is the best choice.

Example:

Visualise Data

Prompt:

”Write python code to visualize [metric] using [choose viz method]”

Example:

Explore Data

Prompt:
"[Insert data sample] Can you do visualizations & descriptive analyses to help me understand the data?"

Perform Linear Regression

Prompt:
”[insert data sample]Can you try regressions and look for patterns? Can you run regression diagnostics?

Maintain data documentation
Data analysts are responsible for maintaining documentation of data sources, data dictionaries, and data processing steps to ensure transparency, reproducibility, and easy access to information for other team members.Example: A data analyst working on a financial reporting project might create and maintain a data dictionary outlining the meaning and format of each column in the dataset, as well as document the data processing and transformation steps taken during the analysis.
Write documentation for functions

Prompt:

I want you to act as a software developer. Please provide documentation for func1 below. [Insert function]

Extract structure out of data sample

Prompt:

"Please help me extract the structure of the following data sample:

Data Sample:[Provide a sample of your data, either as a small dataset, a JSON snippet, or a few rows of a CSV file]

Based on this sample, please provide the inferred structure, including column names, data types, and any relationships or hierarchies that can be observed in the data. Additionally, provide any suggestions or best practices for storing and processing this data using appropriate tools and technologies."

Business Insights and Recommendations
They interpret their findings and provide data-driven insights to support decision-making and improve business processes.Example: A data analyst at a manufacturing company might analyze production data to identify bottlenecks in the assembly line, and recommend process improvements to increase efficiency and reduce costs.
Write OKRs

Prompt:

Write OKRs for my X people data team. The focus for this quarter is X, Y, Z.

Example:

GPT Prompts for Head of Data

Data Strategy Development
ChatGPT can assist the Head of Data in developing a comprehensive data strategy by providing insights into industry trends, best practices, and innovative use cases for data-driven initiatives.Example: A Head of Data at a retail company might consult ChatGPT for recommendations on using natural language processing techniques to analyze customer feedback and improve the customer experience.
Measure Data Team ROI

Prompt:
"Measure data team ROI. Use best practice from this article: https://www.castordoc.com/blog/how-to-measure-the-roi-of-your-data-team”

Write a Job Description

Prompt:

I am recruiting for [insert job title] to take over the following responsibilities [insert responsibilities]Can you draft a job description?Customize it to our company. Here’s an example of other job descriptions in our career page: [insert other job desc]

Identify Key Metrics

Prompt:

Identify 15 key metrics for [insert industry]. Our objective for the year is to [insert strategic priority]. We are already following [X, Y, Z KPIs], please don’t add them but you can suggest complementary KPIs or ways to improve current ones.

Data Infrastructure Assessment
ChatGPT can provide guidance on evaluating and selecting appropriate data storage solutions, processing frameworks, and data pipeline tools that align with the organization's data needs and objectives.
Benchmark Tools

Prompt:
As data engineer, I am interested in benchmarking [list tools or category] to evaluate their performance and suitability for specific use cases. My goal is to identify the best tools for [X]. Please provide a step-by-step guide on how to conduct the benchmark, including the key criteria to consider, the metrics to measure, and any best practices to follow during the benchmarking process. Additionally, if there are any widely-used benchmarking frameworks or tools that can assist in this process, please mention them as well.

Data Governance and Compliance
ChatGPT can help the Head of Data understand complex data regulations, like GDPR or CCPA, and suggest best practices for implementing data governance policies and procedures to ensure compliance.
Define GDPR & HIPAA Process

Prompt:
"Define the GDPR and HIPAA compliance processes that a data team must follow, including key principles, requirements, and best practices. Provide a step-by-step guide on how to implement and maintain a compliant data handling and processing environment, taking into account aspects such as data collection, storage, access, and processing. [add customization depending on the specific organization and data types involved].”

Summarize Data Governance Policies

Prompt:
Explain the following data privacy regulations and requirements:

[insert policy]

Make sure my 15-year old brother can understand this.

Define Data Catalog Roll-Out Plan

Prompt:
[describe your data team]

[describe your data maturity]

[add your timeline constraints]

Can you suggest the best roll out plan for a data catalog project?

Identifying Data-Driven Opportunities
ChatGPT can help the Head of Data uncover new opportunities for leveraging data within the organization by providing examples and use cases of successful data-driven projects in similar industries.Example: A Head of Data at a logistics company might consult ChatGPT for ideas on how to apply advanced analytics techniques, such as predictive modeling or optimization algorithms, to improve supply chain efficiency and reduce costs.
Suggest Resources to Train Team

Prompt:
I want you to act as a data science coach. I would like to train my team about [topic]. Please suggest 3 best specific resources. You can include [specify resource type]

Draft Training Outline & Speaker Notes

Prompt:
Outline and internal team training on [X], include training objectives and outcomes

Summarize Research Paper

Prompt:
"As an academic, please provide a simplified one-paragraph summary of the following research paper: [Insert paper title, author(s), and publication details].”

Predict Market Trends

Prompt:
How does the job of a data team change in a recession?What are the key KPIs to follow?

Team Collaboration
ChatGPT can facilitate communication between the Head of Data and other teams by providing easily understandable explanations of complex data concepts, and help in creating training materials or documentation.Example: A Head of Data at a financial institution might use ChatGPT to generate concise explanations of machine learning algorithms for non-technical stakeholders, promoting a deeper understanding of data-driven initiatives across the organization.
Explain Python

Prompt:
I want you to act as a code explainer. What is this code doing? [Insert code]

Explain SQL

Prompt:
I want you to act as a data science instructor. Can you please explain to me what this SQL code is doing? [Insert SQL code]

Explain Google Sheet formula

Prompt:
I want you to act as a Google Sheets formula explainer. Explain the following Google Sheets command. [Insert formula]

Explain Google Sheet formula

Prompt:
I want you to act as a Google Sheets formula explainer. Explain the following Google Sheets command. [Insert formula]

Explain results to different audience

Level 1

Prompt: I want you to act as a data science instructor. Explain [concept] to a five-year-old.

Level 2

Prompt: I want you to act as a data science instructor. Explain [concept] to an undergraduate.

Level 3

Prompt: I want you to act as a data science instructor. Explain [concept] to a professor.

Level 4

Prompt: I want you to act as a data science instructor. Explain [concept] to a business stakeholder.

Level 5

Prompt: I want you to act as an answerer on StackOverflow. You can provide code snippets, sample tables and outputs to support your answer. [Insert technical question]

Chained Prompt to Build Data Stack Graph
1. Explain Data Infrastructure

Prompt:
I want you to act as a Google Sheets formula explainer. Explain the following Google Sheets command. [Insert formula]

Example:

2. Build Data Infra Mermaid Graph

Prompt:
Awesome now write a mermaid diagram code to explain these relationships

GPT Prompts for Analytics Engineer

ChatGPT can help analytics engineers develop effective data models and transformation logic by providing guidance on best practices, techniques, and tools for data modeling and transformation tasks.Example: An analytics engineer working on customer segmentation might consult ChatGPT for suggestions on feature engineering techniques to enhance the quality of input data for clustering algorithms.
Write Jinja Macro

Prompt:

"Please help me create a Jinja macro for my dbt project:

Macro Purpose: [Describe the purpose of the macro, e.g., calculate the age of users, create a timestamp, or format a currency value]

Input Parameters: [List the input parameters required for the macro, including their names and data types]

Expected Output: [Describe the expected output of the macro, including its data type and any specific formatting requirements]

Please provide the Jinja macro code that meets the requirements and can be used in my dbt project, along with an example of how to use the macro in a dbt model SQL file."

Add Runtime Session

Prompt:

"Please help me add a runtime session setting to a model in my dbt project:

Model Name: [Provide the name of the model you want to apply the runtime session setting to]

Session Setting: [Specify the session setting you want to apply, e.g., setting a specific database schema, changing the statement timeout, or adjusting the query priority]

Please provide step-by-step instructions on how to apply the desired runtime session setting to the specified model in my dbt project, including any required code snippets and best practices for implementing session settings in dbt."

dbt model config

Prompt:

"Write a dbt model configuration for [use case], including necessary configuration settings such as materialization, schema tests, and any other relevant configurations to optimize the model for the given use case. Make sure to include placeholders where customization is needed.”

Convert SQL into dbt model

Prompt:

Convert this SQL code:[insert code] into dbt model. Make sure you include necessary configuration settings such as materialization, schema tests, and any other relevant configurations to optimize the model for the given use case.

Syntax & function guidance

Prompt:

"Provide detailed explanations and examples of common dbt syntax and functions, focusing on their usage in analytics engineering projects. Include explanations of key concepts such as ref(), source(), materializations, incremental models, and schema tests. Make sure to cover both basic and advanced functions, as well as any relevant tips and best practices for their effective application.”

dbt Models / Query Optimization
ChatGPT can assist analytics engineers in optimizing SQL queries for better performance, by providing tips and best practices for writing efficient queries, indexing strategies, and partitioning techniques.Example: An analytics engineer struggling with slow query performance might ask ChatGPT for recommendations on how to optimize a specific SQL query to reduce its execution time.
Data Modeling Questions

Ask it general questions about data modeling. The key here compared to Google/SO is that you can ask follow up questions and request examples

Prompt:

"How can I design a data model for [YOUR USE CASE] that takes into [DATA POINTS]? Please provide insights on entities, attributes, and relationships."

List Best Practices

Prompt:

"Share dbt best practices for analytics engineers, including but not limited to using incremental materializations, adopting proper naming conventions, and organizing projects with packages. Provide explanations, examples, and tips to ensure that analytics engineers are following industry standards and optimizing their work in dbt.”

Use dbt snapshots for data versioning

Prompt:

"Explain the process of using dbt snapshots for data versioning, including the benefits, key concepts, and configuration options. Provide a step-by-step guide on how to create, configure, and manage snapshots in a dbt project, along with best practices for using snapshots effectively.

[Make sure to include placeholders for customization depending on the specific use case or dataset.]”

Integrate Airflow in dbt workflow

Prompt:

"Explain the process of integrating dbt workflows with orchestration tools like Apache Airflow, including the benefits, key concepts, and best practices. Provide a step-by-step guide on how to set up and configure the integration between dbt and Apache Airflow, including creating DAGs, tasks, and any necessary scripts or configurations. Make sure to include placeholders for customization depending on the specific project requirements and use case.”

Data Validation and Quality Assurance
ChatGPT can help analytics engineers implement robust data validation and quality checks by providing examples of data validation techniques, data quality metrics, and monitoring tools.Example: An analytics engineer building a data pipeline for sales data might use ChatGPT to obtain recommendations on automated data validation processes to ensure data accuracy and consistency.
Data Validation Process

Prompt:

"Propose a comprehensive data validation process for a dbt pipeline, including key steps, methodologies, and best practices. Cover aspects such as schema tests, custom data tests, using dbt assertions, and any relevant third-party tools or packages. Provide a step-by-step guide on how to implement and maintain an effective data validation process, making sure to include placeholders for customization depending on the specific project requirements and use case.”

Data Quality Test Procedure

Prompt:

What is the process to add data quality tests to a dbt model?

Data Quality Test Code

Prompt:

”Write data quality test for the following dbt model: [insert code]”

dbt Docs Generator
ChatGPT can support analytics engineers in creating clear and concise documentation of data models, transformation logic, and data pipeline processes to facilitate collaboration and knowledge sharing among team members.Example: An analytics engineer might use ChatGPT to generate a clear and concise explanation of a complex data transformation process, making it easier for other team members to understand and maintain the pipeline.
Add description to dbt model

Prompt:

explain this dbt model. [Insert Model]. Structure the answer in the following format:- 1 liner title about the model- explain step-by-step how the model works

Comment dbt code

Prompt:

[Insert dbt Code]Act as an analytics engineer & add inline comments to explain the most important part of the code. Be consise

Document dbt schemas

Prompt:

[Insert dbt schema] Act as a analytics engineer & describe the schema above.

Identify Gap in Documentation

Prompt:

[Insert dbt code] Identify the gap in the documentation of this dbt code.Make suggestions to improve it

Batch Document Columns - High Quality

Prompt:

[insert data sample]

[insert dbt model]

Please document this data table based on the column values & dbt model.

Explain dbt model

Prompt:

Explain dbt model with simple terms that a business user can understand.

Exploratory Data Analysis
ChatGPT can assist analytics engineers in conducting exploratory data analysis by suggesting statistical techniques, visualizations, and tools to identify patterns, trends, and relationships in data.

Example: An analytics engineer analyzing user engagement data from a mobile app might consult ChatGPT for ideas on which visualizations and statistical tests would be most effective in uncovering insights about user behavior.
Suggest Statistical technique

Prompt:

I want to do [X] with the following data [insert data]. Can you suggest statistical techniques that will help me do [X]. Provide a SQL code sample if possible.

Missing Data Ideas

Prompt:

if I am missing [X] data, what is best way to measure [X]

Find Best Visualisation Ideas

Prompt:

[Insert dbt schema] Act as a analytics engineer & describe the schema above.

Identify Gap in Documentation

Prompt:

I want to do [X] with the following data [insert data]. Can you suggest the best data visualisation idea for my use-case?

GPT Prompts for Business Analyst (COMING SOON)

GPT Prompts for Data Architect (COMING SOON)