GPT Prompts for Data Teams
🤖

GPT Prompts for Data Teams

Most of the prompts below are real life prompts suggested by readers, used by our team internally at Castor, found on Reddit, or gathered from conversations at data events. Not all of them will be relevant to you but the objective is to inspire people to try ChatGPT to drive productivity.

💡
How to use this GPT Prompts guide?
  1. Explore the sections that are interesting to you.
  2. Do not Copy/Paste any proprietary data in ChatGPT. This can be detrimental for your company. Always generate fake data or make sure what you are sending is non-sensitive
  3. GPT4 will give you way better experience with coding scripts than GPT3. Still not perfect yet but will keep improving over time.
  4. Customize GPT to yourself before asking him anything. Write him a quick 10 lines presentation about you, what you care about & what your goals are. This will increase drastically the output of the following prompts.
  5. You might always need to tweak the code ChatGPT gives back but gets you 90% in the right direction.

Want to help?

➡️ Give feedback in the chat, in the bottom right corner.

➡️ Share it so more data teams can increase their productivity with ChatGPT

🏗️
GPT Prompts for Data Engineering
Data Pipeline Development

Design, build, and maintain data pipelines to ingest, clean, transform, and store data from various sources into data storage systems, such as data warehouses or data lakes, ensuring data is available for analytics and machine learning tasks.

Prompts

Generate Data
Prompt: I want you to act as a fake data generator. I need a dataset that has [x] rows and [y] columns: [insert column names]
Output
image
Generate Data From DDL
Prompt:

"Please help me generate sample data for the following SQL DDL table definition:

SQL DDL: [Provide your SQL DDL table definition, including table name, column names, and data types]

Based on the table definition, please generate a set of somewhat realistic sample data that can be used for testing and mock data generation. Ensure that the sample data is consistent with the meaning of the column names and adheres to the specified data types."

Join Datasets
Prompt: I want you to act as a data engineer and code in python for me. I have a two datasets A and B. A is [explain A structure]. B is [explain B structure]. I need to join them on a foreign key [enter FK].
Output
image
Create CSV→BigQuery Pipeline
Prompt: "Act as a senior data engineer & provide a Python code sample demonstrating data engineering best practices to move data from a CSV file to BigQuery. Use the standard library when possible, but feel free to use external libraries if they significantly improve the process.”

Example:
image
Invert Dependency Tree

I had a list of immediate dependencies for jobs I wanted to reverse to find upstream sources

Prompt: In python I have a dependency tree in a dict. Write a script to invert that dependency tree.

Write Airflow Dag
Prompt: I'm working on a data pipeline using Apache Airflow, and I need to create a DAG that performs the following tasks in sequence:
  1. Extract data from an API and save it to a CSV file.
  2. Load the CSV data into a PostgreSQL database.
  3. Run an SQL query on the database to aggregate the data and generate a report.
  4. Email the report to a list of recipients.

Can you help me write an Airflow DAG that accomplishes these tasks? Please include comments explaining each part of the DAG, and assume that I have the necessary Python functions to perform the data extraction, loading, querying, and emailing tasks.

Data Integration

Integrate data from disparate sources, such as APIs, databases, or files, to create a unified view of the data. This often involves understanding different data formats, schemas, and dealing with inconsistencies or missing data.

Prompts

Regex Writing
Prompt: "Help me solve this regex problem: I need to create a regular expression pattern that matches [specific requirement]. Can you provide a regex pattern and explain how it works?”
SQL Troubleshooting
Prompt:

"Please help me identify any issues or potential problems in the following SQL code:

[Insert SQL Code]

Analyze the provided SQL code and point out any syntax errors, logical issues, performance concerns, or best practice violations that may be present. Additionally, suggest possible improvements or fixes for the identified problems."

Generate Mermaid Diagram
Prompt:

"Please help me create a Mermaid diagram based on the following description:

Diagram type: [Specify the type of Mermaid diagram, such as flowchart, Gantt, or sequence diagram]

Description:

  • Elements: [List the elements of the diagram, including their names, types, and relationships]
  • Relationships: [Describe the relationships between elements, such as connections, dependencies, or hierarchies]
  • Layout: [Specify any preferred layout or arrangement of the diagram elements, if applicable]
  • Styling: [Mention any desired styling options, such as colors, fonts, or line styles]

Please provide the Mermaid code to generate the described diagram."

Example:

image
Translate Code Between DBMS
Prompt: I want you to act as a coder and write SQL code for [DBMS 1]. What is the equivalent of [DBMS 2]'s DATE_TRUNC for MySQL?

Data Storage & Management

Design and manage data storage solutions, such as relational databases, NoSQL databases, or distributed file systems, to ensure data is organized, accessible, and scalable. This includes tasks like schema design, indexing, and partitioning.

Prompts

Create a persistent Hive Table
Prompt: How to make Hive table persist in pyspark
Example
image
image
Create Stacks for AWS Cloud Formation
Prompt:

"Please help me create an AWS CloudFormation stack using the AWS Cloud Development Kit (CDK) in Python:

AWS Services to Include: [List the AWS services you want to include in the stack, e.g., EC2, S3, Lambda, RDS, etc.]

Stack Requirements: [Provide specific requirements for the stack, such as the desired instance types, number of instances, storage capacities, or any other configuration details]

Please provide step-by-step instructions and Python code for creating the CloudFormation stack using the AWS CDK, along with any necessary prerequisites, imports, and dependencies. Additionally, include any tips or best practices for working with the AWS CDK and CloudFormation in Python."

Data Quality & Monitoring

Implement data quality checks, validation rules, and monitoring systems to ensure the accuracy, consistency, and reliability of the data. Identify and resolve data quality issues, and set up alerts to notify the team of any issues.

Prompts

Suggest Edge Cases
Prompt: I want you to act as a software developer. Please help me catch edge cases for this function [insert function]
Suggest Data Quality Tests
Prompt: I want you to act as a software developer. Please help me catch data quality tests for this data pipeline [insert code]

Performance Optimisation

Optimize data processing tasks, queries, and storage systems to improve performance and reduce latency. This may involve tuning database configurations, optimizing SQL queries, or leveraging big data processing frameworks such as Apache Spark.

Prompts

Chain Optimization for a SQL query
Prompt 1

I have a SQL query that I'd like to optimize. Here's the query: SELECT * FROM orders JOIN customers ON orders.customer_id = customers.customer_id WHERE customers.country = 'USA';

Can you help me identify any potential performance issues with this query?

Prompt 2 Thank you for the feedback. I've checked and there are no indexes on the 'customer_id' column in both the 'orders' and 'customers' tables, and there's no index on the 'country' column in the 'customers' table. Should I create any indexes to improve the query's performance? If so, which columns should I index?

Prompt 3 I see, that makes sense. I also noticed that I'm using 'SELECT *' in the query, which selects all columns from both tables. However, I only need a few specific columns from each table. How should I modify the query to select only the columns I need, and will this improve performance?

Prompt 4 Thanks for the advice. I'm also concerned about the number of rows returned by the query. It could potentially return a large number of rows. Is there a way to paginate the results so that I only retrieve a limited number of rows at a time? How can I implement pagination in the query?

Prompt 5 I've made the suggested changes to the query, and it's performing better now. However, I'd like to further analyze the query's execution plan to see if there are any additional optimizations I can make. How can I view the execution plan for the query, and what should I look for in the plan to identify potential performance improvements?

Prompt 6 I've reviewed the execution plan and made some additional optimizations based on your recommendations. Is there anything else I should consider to further optimize the query? Are there any best practices I should follow to ensure the query remains efficient as the database grows?

Spark Optimization Ideas
Prompt: I'm working with Apache Spark to process large datasets, and I'm looking for ways to optimize the performance of my Spark jobs. Specifically, I'm interested in improving the execution time, reducing the memory footprint, and minimizing data shuffling. Can you provide me with some practical ideas and best practices for optimizing Spark jobs? Additionally, if you have any tips for tuning Spark configurations, I'd appreciate hearing them.
Optimize Pandas
Prompt: I want you to act as a code optimizer. Can you point out what's wrong with the following Pandas code and optimize it? [Insert code here]
Optimize Code Perf
Prompt: I want you to act as a software developer. Please help me improve the time complexity of the code below. [Insert code]
Optimize Python
Prompt: I want you to act as a code optimizer. The code is poorly written. How do I correct it? [Insert code here]
Simplify Python
Prompt: I want you to act as a python code simplifier. Can you simplify the following code?
Communicate Optimization ROI

"Assuming you are a data engineer who has optimized data pipeline processes, Code performance or SQL query output. The output is the following: [provide ROI & metrics about the optimisation] Provide a non-technical explanation highlighting the importance and benefits of these optimizations for business stakeholders, and how it can contribute to the overall success of the company.

Structure the output in 3 bullet points and less than 150 words. Keep it data driven."

Example
image

💡
GPT Prompts for Data Science
Data Exploration and Preprocessing

Data scientists explore datasets to understand their structure, patterns, and potential issues. They preprocess data by cleaning, transforming, and aggregating it to prepare it for analysis.

Example: A data scientist working for an e-commerce company might explore customer purchase data, clean missing values, and aggregate it by product categories to analyze sales trends.

Prompts

Suggest Data
Prompt: "I am working on a project to build a predictive model for [insert specific problem or domain] and would like to showcase my expertise in [insert specific skills or techniques]. Can you recommend the top five datasets that would be most suitable for my use case, allowing me to effectively demonstrate my knowledge and skills?”
Explore Data
Prompt: I want you to act as a data scientist and code for me. I have a dataset of [describe dataset]. Please write code for data visualisation and exploration.
Write a Regex
Prompt: I want you to act as a coder. Please write me a regex in python that [describe regex]
Complete SQL Code
Prompt: I'm working on a SQL task that involves creating a series of similar tables for different months. Each table should have the same structure, but the table names should include the month and year. The structure of each table is as follows:
  • id (integer, primary key)
  • name (varchar)
  • amount (decimal)
  • date (date)

I need to create tables for the months of January, February, and March 2023. The table names should be in the format "sales_YYYY_MM" (e.g., "sales_2023_01" for January 2023). I find this task a bit repetitive and boring, so I'm hoping you can help me generate the SQL code to create these tables. Thanks!

Address Imbalanced Data
Prompt: I want you to act as a coder. I have trained a machine learning model on an imbalanced dataset. The predictor variable is the column [Insert column name]. In Python, how do I oversample and/or undersample my data?

Create a Sankey Diagram
Prompt:

Please help me create a Sankey diagram with the following information:

  • Number of stages or categories: [number_of_stages]
  • Stage names: [stage_1_name], [stage_2_name], ..., [stage_n_name]
  • Connections between stages and their flow quantities:
    • From [stage_name] to [stage_name]: [quantity]
    • From [stage_name] to [stage_name]: [quantity] ...
    • From [stage_name] to [stage_name]: [quantity]

Thank you!

Feature Engineering and Selection

They create relevant features from raw data that can help improve the performance of machine learning models. They also select the most important features to reduce complexity and improve model interpretability. Example: In a credit risk assessment project, a data scientist might create features such as debt-to-income ratio and credit utilization, and then use feature selection techniques to identify the most predictive features for a credit scoring model.

Prompts

Train Classification Model
Prompt: I want you to act as a data scientist and code for me. I have a dataset of [describe dataset]. Please build a machine learning model that predicts [target variable].
Get Feature Importance
Prompt: "As a data scientist, I have trained a decision tree model using [insert model details here, e.g., dataset, libraries, and settings]. Can you help me understand the results of this model and provide Python code to identify the most important features?”
Tune Hyperparameters
Prompt: I want you to act as a data scientist and code for me. I have trained a [model name]. Please write the code to tune the hyperparameters.
Get Data Set Structure
Prompt: I have [insert dataset type] dataset: [copy dataset sample]. Can you describe this dataset? I want to reuse this description in another ChatGPT prompt later on. Make sure you extract in a structured format: - table name - list of columns - 3 associated business tags - 5 first lines as data sample
Train Time Series
Prompt: I want you to act as a data scientist and code for me. I have a time series dataset [describe dataset]. Please build a machine learning model that predicts [target variable]. Please use [time range] as train and [time range] as validation.

Model Development and Evaluation

Data scientists build, train, and evaluate machine learning models to make predictions or uncover patterns in data. Example: A data scientist at a streaming service might develop a recommendation engine using collaborative filtering or matrix factorization to provide personalized content recommendations to users.

Prompts

Naive Bayes Hypertuning
Prompt:

"Please help me with using a naive Bayes approach for hyperparameter tuning in Databricks:

  1. Dataset: [Provide details about the dataset, including its location, format, and features]
  2. Problem: [Specify the problem you are trying to solve, such as classification or regression]
  3. Hyperparameters: [List the hyperparameters you want to tune, such as learning rate, number of iterations, or regularization parameters]
  4. Search space: [Define the search space for each hyperparameter, e.g., ranges or specific values to be explored]
  5. Evaluation metric: [Mention the evaluation metric to be used for comparing model performance, such as accuracy, F1 score, or mean squared error]

Please provide step-by-step instructions on how to perform hyperparameter tuning using a naive Bayes approach in Databricks, including any required code snippets and best practices."

Automatic Machine Learning
Prompt: I want you to act as an automatic machine learning (AutoML) bot using TPOT for me. I am working on a model that predicts [...]. Please write Python code to find the best classification model with the highest AUC score on the test set.

Interpretation and Visualization

They interpret the results of their analyses and models, and create visualizations to communicate their findings to stakeholders in an easily understandable manner. Example: A data scientist analyzing marketing campaign data might create visualizations of customer segments, conversion rates, and revenue generated, and present these findings to the marketing team to inform future campaign strategies.

Prompts

Write Pyspark Struct

"Please help me create PySpark StructType and StructField schema definitions for the following dataset

[Provide Data Sample]

Dataset columns:

  1. Column Name: [Name of the first column] Data Type: [Data type of the first column, e.g., StringType, IntegerType, DoubleType, etc.] Nullable: [True/False, indicating if the first column can contain null values]
  2. Column Name: [Name of the second column] Data Type: [Data type of the second column] Nullable: [True/False, indicating if the second column can contain null values]

[Continue with further columns as needed]

Please provide the PySpark code for creating the StructType and StructField objects that define the schema for this dataset."

A/B Testing
Prompt: I want you to act as a statistician. [Describe context] Please design an A/B test for this purpose. Please include the concrete steps on which statistical test I should run.

Deployment and Maintenance

Data scientists collaborate with engineers to deploy their models into production and monitor their performance, making adjustments as needed to ensure continued accuracy and relevance. Example: A data scientist working on a fraud detection system for a bank might deploy their model using a REST API, then continually monitor its performance, updating the model as new fraud patterns emerge.

Prompts

Compare Function Speed
Prompt: I want you to act as a software developer. I would like to compare the efficiency of two algorithms that performs the same task in Python. Please write code that helps me run an experiment that can be repeated for 5 times. Please output the runtime and other summary statistics of the experiment. [Insert functions]
Improve Codebase Readability
Prompt: I want you to act as a code analyzer. Can you improve the following code for readability and maintainability? [Insert code]
Enforce Pandas Test
Prompt: I want you to act as a data scientist. Please write code to test if that my pandas Dataframe [insert requirements here]
Write Unit Test
Prompt: I want you to act as a software developer. Please write unit tests for the function [Insert function]. The test cases are: [Insert test cases]
Analyse Complexity
Prompt: I want you to act as a software developer. Please compare the time complexity of the two algorithms below. [Insert two functions]
Debug

Python

Prompt: I want you to act as a software developer. This code is supposed to [expected function]. Please help me debug this Python code that cannot be run. [Insert function]

SQL

Prompt: I want you to act as a SQL code corrector. This code does not run in [your DBMS, e.g. PostgreSQL]. Can you correct it for me? [SQL code here]

🤯
GPT Prompts for Head of Data
Data Strategy Development

ChatGPT can assist the Head of Data in developing a comprehensive data strategy by providing insights into industry trends, best practices, and innovative use cases for data-driven initiatives. Example: A Head of Data at a retail company might consult ChatGPT for recommendations on using natural language processing techniques to analyze customer feedback and improve the customer experience.

Prompts

Measure Data Team ROI
Prompts: "Measure data team ROI. Use best practice from this article: https://www.castordoc.com/blog/how-to-measure-the-roi-of-your-data-team”

Example

To measure your data team's ROI, follow these best practices:

  1. Set SMART objectives: Ensure goals are Specific, Measurable, Achievable, Relevant, and Time-bound, aligned with organizational priorities.
  2. Identify data team costs, including salaries, software, hardware, and training.
  3. Quantify value generated, considering revenue generation, cost savings, and operational efficiency improvements.
  4. Calculate ROI as a percentage: (Value Generated / Data Team Costs) x 100.
  5. Continuously monitor and report ROI to stakeholders, demonstrating the data team's value and fostering a data-driven culture.

This approach helps assess the impact of your data team's work and ensures alignment with business objectives.

Write a Job Description
Prompt: I am recruiting for [insert job title] to take over the following responsibilities [insert responsibilities] Can you draft a job description? Customize it to our company. Here’s an example of other job descriptions in our career page: [insert other job desc]

Identify Key Metrics
Prompt: Identify 15 key metrics for [insert industry]. Our objective for the year is to [insert strategic priority]. We are already following [X, Y, Z KPIs], please don’t add them but you can suggest complementary KPIs or ways to improve current ones.

Data Infrastructure Assessment

ChatGPT can provide guidance on evaluating and selecting appropriate data storage solutions, processing frameworks, and data pipeline tools that align with the organization's data needs and objectives.

Prompts

Benchmark Tools
Prompt: As data engineer, I am interested in benchmarking [list tools or category] to evaluate their performance and suitability for specific use cases. My goal is to identify the best tools for [X]. Please provide a step-by-step guide on how to conduct the benchmark, including the key criteria to consider, the metrics to measure, and any best practices to follow during the benchmarking process. Additionally, if there are any widely-used benchmarking frameworks or tools that can assist in this process, please mention them as well.

Data Governance and Compliance

ChatGPT can help the Head of Data understand complex data regulations, like GDPR or CCPA, and suggest best practices for implementing data governance policies and procedures to ensure compliance.

Prompt

Define GDPR & HIPAA Process
Prompt: "Define the GDPR and HIPAA compliance processes that a data team must follow, including key principles, requirements, and best practices. Provide a step-by-step guide on how to implement and maintain a compliant data handling and processing environment, taking into account aspects such as data collection, storage, access, and processing. [add customization depending on the specific organization and data types involved].

Summarize Data Governance Policies
Prompt: Explain the following data privacy regulations and requirements: [insert policy] Make sure my 15-year old brother can understand this.
Define Data Catalog Roll-Out Plan
Prompt: [describe your data team] [describe your data maturity] [add your timeline constraints] Can you suggest the best roll out plan for a data catalog project?

Team Collaboration

ChatGPT can facilitate communication between the Head of Data and other teams by providing easily understandable explanations of complex data concepts, and help in creating training materials or documentation. Example: A Head of Data at a financial institution might use ChatGPT to generate concise explanations of machine learning algorithms for non-technical stakeholders, promoting a deeper understanding of data-driven initiatives across the organization.

Prompts

Explain Python
Prompt: I want you to act as a code explainer. What is this code doing? [Insert code]
Explain SQL
Prompt: I want you to act as a data science instructor. Can you please explain to me what this SQL code is doing? [Insert SQL code]
Explain Google Sheet formula
Prompt: I want you to act as a Google Sheets formula explainer. Explain the following Google Sheets command. [Insert formula]
Explain results to different audience

Level 1

Prompt: I want you to act as a data science instructor. Explain [concept] to a five-year-old.

Level 2

Prompt: I want you to act as a data science instructor. Explain [concept] to an undergraduate.

Level 3

Prompt: I want you to act as a data science instructor. Explain [concept] to a professor.

Level 4

Prompt: I want you to act as a data science instructor. Explain [concept] to a business stakeholder.

Level 5

Prompt: I want you to act as an answerer on StackOverflow. You can provide code snippets, sample tables and outputs to support your answer. [Insert technical question]

Chained Prompt to Build Data Stack Graph

1. Explain Data Infrastructure
Prompt "you are a data engineer in a company using Fivetran, Snowflake, dbt, Looker and Castor (data catalog). Explain to a business user in your organization how all of these tools work together.”
Example
image
2. Build Data Infra Mermaid Graph
Chained Prompt: Awesome now write a mermaid diagram code to explain these relationships
Example
image
graph LR
    A[Fivetran] -- Data Ingestion --> B[Snowflake]
    B -- Data Storage --> C[dbt]
    C -- Data Transformation --> D[Looker]
    D -- Data Analysis & Visualization --> E[Business Users]
    B -- Data Cataloging --> F[Castor]
    F -- Metadata & Lineage --> E
graph LR
    A[Fivetran] -- Data Ingestion --> B[Snowflake]
    B -- Data Storage --> C[dbt]
    C -- Data Transformation --> D[Looker]
    D -- Data Analysis & Visualization --> E[Business Users]
    B -- Data Cataloging --> F[Castor]
    F -- Metadata & Lineage --> E

Identifying Data-Driven Opportunities

ChatGPT can help the Head of Data uncover new opportunities for leveraging data within the organization by providing examples and use cases of successful data-driven projects in similar industries. Example: A Head of Data at a logistics company might consult ChatGPT for ideas on how to apply advanced analytics techniques, such as predictive modeling or optimization algorithms, to improve supply chain efficiency and reduce costs.

Prompts

Suggest Resources to Train Team
Prompt: I want you to act as a data science coach. I would like to train my team about [topic]. Please suggest 3 best specific resources. You can include [specify resource type]
Draft Training Outline & Speaker Notes
Prompt: Outline and internal team training on [X], include training objectives and outcomes
Summarize Research Paper
Prompt: "As an academic, please provide a simplified one-paragraph summary of the following research paper: [Insert paper title, author(s), and publication details].”
Predict Market Trends
Prompt How does the job of a data team change in a recession? What are the key KPIs to follow?

💼  GPT Prompts for Business Analyst

Coming Soon
💡
GPT Prompts for Data Governance
Data Governance Strategy

Develop and implement a comprehensive data governance strategy and framework that aligns with the organization's overall business objectives. This includes defining goals, policies, procedures, and metrics to measure success.

Prompts

Getting Started with Governance
Prompt: "Assuming a team has no existing data governance framework in place, provide a step-by-step guide on how to implement data governance from scratch, prioritizing the most important aspects first.”
Example:
image
Define Data Governance Goals
Prompt: ”Act a data governance leader. You work in a company doing [industry] . Data is strategic in your company for [X, Y, Z reasons]. Your data governance practice has [number of years]. You have already succesfully implemented [Project 1, 2, 3]. You need to define data governance goals for the next quarter. You want to impact [Strategic Project A & B].
Write Access Right Policy Framework on Snowflake
Prompt: ”Act as a security engineer from Snowflake. You want to write the Access Control Privileges for your company. Here’s the role & access levels I want to create: [ Role 1: System Access 1, Schema Access 1, Object Access 1 Role 2: System Access 2, Schema Access 2, Object Access 2 Role 3: System Access 3, Schema Access 3, Object Access 3 ] "
Example
image

Design a strategy “by the books” 📚

List data governance books to read
Prompt: can you recommend me books to read on data governance?
Example:
image
Summarize a book on data governance
Prompt:

Can you give me an in-depth summary of the following book on data governance? I am already familiar to the data governance world.

[Insert Book Title & writer]

Example
image
Ask data governance strategy based on a specific book
Prompt: design a data governance strategy for [Add your industry] to [add context & use case] based on the principles in this book: [add book details]

Data Quality Measurement

Establish data quality standards, guidelines, and best practices to ensure data accuracy, consistency, and reliability. Oversee the implementation of data quality checks, validation rules, and monitoring systems to identify and resolve data quality issues.

Prompts

Improve Codebase Readability
Prompt: I want you to act as a code analyzer. Can you improve the following code for readability and maintainability? [Insert code]
Write Data Quality Tests
Prompt: ”Here’s a table: [insert table sample] Can you write data quality tests in SQL/python to make sure the output is consistent. Flag nulls & duplicates.”
Data Quality Standards
Prompt:

"Please describe the key data quality standards you would like to establish within your company. Consider including aspects such as accuracy, completeness, consistency, timeliness, and uniqueness. For each standard, provide a brief explanation and suggest appropriate metrics or methods to measure and ensure compliance. Additionally, mention any specific industry regulations or requirements that need to be adhered to.

  1. Standard Name (e.g., Accuracy):
    • Explanation: Briefly explain the importance of this standard.
    • Measurement/Compliance Method: How will you measure and ensure compliance with this standard?
    • Industry Requirements (if applicable): Any specific industry regulations to be considered.
  2. Standard Name (e.g., Completeness):
    • Explanation: Briefly explain the importance of this standard.
    • Measurement/Compliance Method: How will you measure and ensure compliance with this standard?
    • Industry Requirements (if applicable): Any specific industry regulations to be considered.

[Add more standards as necessary]"

Generate a Training Plan on Data Quality Best Practice
Prompt

As a data governance expert, I am tasked with creating a training session for my company's employees on data quality best practices. The goal of this training is to educate employees on the importance of data quality, common data quality issues, and best practices for ensuring high-quality data. Please provide an outline for the training session, including key topics and explanations for each section. Make sure to cover the following areas:

  • Introduction to data quality
  • [add common data quality issues and their impact at your company]
  • [best practices for data quality management at your company]
  • Practical tips for maintaining data quality
  • Conclusion and next steps

Data Privacy & Security

Ensure that data privacy and security policies are in place and enforced to protect sensitive information and comply with applicable regulations, such as GDPR or HIPAA. This includes overseeing access controls, encryption, and data masking techniques.

Prompts

Compliance Checklist
Prompt: "Please provide a summary of the [X compliance standard] and create a prioritized checklist to help organizations ensure their adherence to the requirements of this standard. Provide the answer in a table”

Example
image
List Personal Information from Table Metadata

Use prompt to extract metadata from table:

Prompt: [insert output of prompt above] Can you list all the columns that contains personal information?
Select the best encryption method for a specific dataset
Prompt: ”As an AI expert in data security, I am seeking advice on the best methods to encrypt data. My goal is to ensure the confidentiality and integrity of sensitive information. Please provide a list of recommended encryption methods, along with brief descriptions of each method and their use cases. Additionally, if there are any Python libraries that can be used to implement these encryption methods, please mention them as well.”
Create a data governance assistant
Here’re our data governance policies: [insert policies] Can you answer all the following questions based on what is written in this policy?

Data Stewardship

Lead a team of data stewards responsible for managing, maintaining, and documenting the organization's data assets. Ensure that data stewards are trained and have a clear understanding of their roles and responsibilities.

Prompts

Classify & Tag Data Tables
Prompt: Generate business tags for a table named: [table name]. With the following columns: [columns name] . The query used to create the table: [insert query]. And for non-sensitive tables, you can add a data sample: [data sample].
Example
image
image
Organize and group data tables
Prompt Organize & regroup this list of data tables by theme and business tags: [List Tables]
Example
image
image
Write a memo after a data quality issue
Copy/Paste Jira Ticket Can you write a memo to summarize the issue in this ticket? Please structure the answer in the following format.

[Your Name] [Your Title/Position] [Your Department] [Date]

TO: [Recipient Name(s)] CC: [Optional - Other Relevant Parties to be Copied] FROM: [Your Name] SUBJECT: Data Quality Issue and Resolution

Dear [Recipient Name(s)],

I am writing to inform you of a recent data quality issue that was identified within our [data system/database] and to outline the steps taken to address and resolve the matter.

Issue Description: [Provide a brief and clear description of the data quality issue. Include details such as the nature of the problem, the data set(s) affected, and the potential impact on business operations or decision-making.]

Issue Discovery: [Explain how the data quality issue was discovered. If applicable, mention any tools or processes used to identify the issue.]

Resolution Steps: [Outline the steps taken to address and resolve the data quality issue. Include any corrective actions, data validation, or data cleansing processes that were implemented. If the issue has not been fully resolved, explain the ongoing efforts to address it.]

Preventive Measures: [Describe any preventive measures or process improvements that have been put in place to avoid similar data quality issues in the future. This may include changes to data validation rules, data governance policies, or staff training.]

Next Steps: [If applicable, outline any next steps or actions that need to be taken by the recipient(s) or other stakeholders. This may include reviewing updated data, providing feedback, or participating in meetings to discuss the issue further.]

I would like to thank [relevant team members or departments] for their prompt and diligent efforts in addressing this issue. Ensuring the accuracy and integrity of our data is a top priority, and we are committed to continuously improving our data management practices.

Please do not hesitate to reach out to me if you have any questions or require further information regarding this matter.

Thank you for your attention to this issue.

Sincerely,

Stakeholder Communication

Collaborate with various stakeholders, such as data engineers, data scientists, analysts, and business leaders, to understand their data needs and ensure that data governance initiatives support their requirements. Communicate data governance policies, updates, and best practices throughout the organization to drive awareness and adoption.

Write a Jira Ticket
Prompt:

"Please help me create a Jira ticket with the following details:

Title: [Short, descriptive summary of the issue or feature request]

Description:

  • Background: [Provide context or background information about the issue or feature request]
  • Issue/Feature: [Explain the problem or desired functionality in detail]
  • Expected behavior: [Describe what the expected outcome should be]
  • Steps to reproduce: [If applicable, list the steps required to reproduce the issue]
  • Acceptance criteria: [Clearly define the criteria that must be met for the ticket to be considered complete]
  • Additional notes: [Include any other relevant information, such as screenshots, logs, or potential solutions]"
Convert Code in a language you understand
Prompt: Convert this code [insert code] into SQL. You can also guide me through what the code is doing.
Explain Technical Data Concepts

"Please help me explain the technical data concept of [Technical data concept] to a non-technical business user, focusing on the [Add context about industry] industry.

Provide a clear and concise explanation of the concept, tailored to someone without a technical background, and include a relevant example from the specified industry to help illustrate the concept's application and importance in that context."

Convince leadership to invest in tooling
Prompt: "Compose a persuasive message to leadership advocating for the investment in a [tool], outlining the reasons for the investment, who will benefit, the estimated cost, and the expected impact on the organization.”

Example:
image

🔭
GPT Prompts for Data Analyst
Data Collection and Cleaning

Data analysts gather data from various sources, such as databases, APIs, or spreadsheets, and clean it to ensure accuracy and consistency. Example: A data analyst at a healthcare organization might collect patient data from different hospital departments, clean and standardize it to ensure consistent formatting, and merge it into a single dataset for analysis.

Prompts

Generate Data
Prompt: I want you to act as a fake data generator. I need a dataset that has [x] rows and [y] columns: [insert column names]
Output
image
Generate Data From DDL
Prompt:

"Please help me generate sample data for the following SQL DDL table definition:

SQL DDL: [Provide your SQL DDL table definition, including table name, column names, and data types]

Based on the table definition, please generate a set of somewhat realistic sample data that can be used for testing and mock data generation. Ensure that the sample data is consistent with the meaning of the column names and adheres to the specified data types."

Design Panda functions
Prompt:

"Please help me perform a specific operation (x) on the following example DataFrame represented as a table in Markdown format:

[Insert Example DataFrame]

Operation (x): [Describe the desired operation, e.g., filter rows based on a condition, calculate a new column, sort the DataFrame, or group by a specific column]

Please provide the necessary Pandas code to perform the specified operation (x) on this example DataFrame, and show the resulting DataFrame after the operation is applied."

Clean Dataset
Prompt: "Please provide a Python code snippet that demonstrates how to clean and preprocess a dataset, including handling missing values, removing duplicates, and standardizing data formats. Use a sample dataset with columns 'Name,' 'Age,' 'Gender,' and 'Email' for this demonstration.”
Example
image
image
Merge Datasets
Prompt: "Please provide a Python code snippet that demonstrates how to merge two datasets using the Pandas library. Assume that the first dataset, 'df1,' contains columns 'ID,' 'Name,' and 'Age,' and the second dataset, 'df2,' contains columns 'ID,' 'City,' and 'Country.' Merge the two datasets on the 'ID' column, and show the resulting merged dataset.”
Build a simple data scraper
Prompt: "Please provide a Python code snippet that demonstrates how to scrape data from the homepage of 'www.castordoc.com' using the BeautifulSoup and requests libraries. Extract and display the page title and the text content of the main headings (e.g., h1, h2) on the page. Note: Ensure that your web scraping practices comply with the website's terms of service. Store the data in a pd dataframe"

Example
image
Collect Data from an API
Prompt: "Please provide a Python code snippet that demonstrates how to collect data from a public REST API endpoint using the 'requests' library. As an example, use the following API endpoint that returns JSON data about users: 'https://jsonplaceholder.typicode.com/users'. Retrieve the data, parse the JSON response, and display the result in a readable format."
Example:
image

Data Exploration and Analysis

They explore datasets to understand their structure, identify patterns, trends, and relationships, and perform statistical analyses to test hypotheses. Example: A data analyst at an e-commerce company might analyze customer purchase data to identify seasonal trends, high-performing products, and customer segments with different spending behaviors.

Prompts

Explore Data
Prompt: I want you to act as a data engineer and code for me. I have a dataset of [describe dataset]. Please write code for data visualisation and exploration.
Calculate Running Average

"As a data scientist, I have a table with two columns: [Insert column names]. I'd like to calculate a running average for [specify the desired value or column]. Can you provide the SQL code to accomplish this in PostgreSQL 14?”

Rewrite used queries to modify them slightly
Prompt:

"Please help me modify the following SQL query to achieve a slightly different result:

[Insert Original SQL Query]

Original Query Purpose: [Describe the purpose or goal of the original SQL query]

Desired Modification: [Explain the specific modification you want to make to the query, such as changing the filtering criteria, adding or removing columns, modifying the aggregation, or altering the sorting order]

Please provide the modified SQL query that achieves the desired result, along with an explanation of the changes made and how the new query differs from the original one."

Translate SQL Dialects
Prompt: What is the equivalent of the FUNC1 function in BigQuery?

Example: What is the equivalent of the MIN_BY() function in BigQuery?

Compare 2 similar SQL code
Prompt: "Please help me compare the following two similar SQL queries and explain the differences between them: [SQL QUERY 1] [SQL QUERY 2] Analyze both SQL queries and provide a detailed comparison that highlights the differences in terms of structure, syntax, filtering criteria, columns selected, aggregation, and any other relevant aspects. Additionally, explain how these differences may impact the results returned by each query and any potential implications for performance or data accuracy.”
PowerBI Modeling
"As a Power BI expert, please analyze the details of my current project [insert project details here], focusing on the table structure and relationships. Are there any issues or areas for improvement you can identify within the table?”

Chain Prompting

1. Generate SQL Query
Prompt:

“As a senior data analyst, [insert schema & data sample] given the above schemas and data, write a detailed and correct [insert DBMS] sql query to answer the analytical question:

[question]

Comment the query with your logic.”

2. Double Check SQL Query
Prompt:

“Double check the Postgres query above for common mistakes, including:

- Remembering to add `NULLS LAST` to an ORDER BY DESC clause

- Handling case sensitivity, e.g. using ILIKE instead of LIKE

- Ensuring the join columns are correct

- Casting values to the appropriate type

Rewrite the query here if there are any mistakes. If it looks good as it is, just reproduce the original query."

3. Debug Query Against DB
Prompt:

[insert query from previous prompt]

The query above produced the following error:

[insert query error]

Rewrite the query with the error fixed:"

Reporting and Visualization

Data analysts create reports and visualizations to present their findings in a clear and concise manner to stakeholders, often using tools like Tableau or Power BI. Example: A data analyst working for a marketing agency might create a dashboard displaying the performance metrics of an advertising campaign, such as impressions, click-through rates, and conversions, to help clients understand the campaign's effectiveness.

Prompts

Write Pyspark Struct

"Please help me create PySpark StructType and StructField schema definitions for the following dataset:

Dataset columns:

  1. Column Name: [Name of the first column] Data Type: [Data type of the first column, e.g., StringType, IntegerType, DoubleType, etc.] Nullable: [True/False, indicating if the first column can contain null values]
  2. Column Name: [Name of the second column] Data Type: [Data type of the second column] Nullable: [True/False, indicating if the second column can contain null values]

[Continue with further columns as needed]

Please provide the PySpark code for creating the StructType and StructField objects that define the schema for this dataset."

Choose Visualisation Method
Prompt: ”As an expert in data visualization, I need your help to choose the best visualization method for the following problem:

[PROBLEM]

Please describe the problem in detail and recommend the most appropriate visualization method to effectively communicate the information. Explain why you think this method is the best choice.

Example
image
Visualise Data
Prompt: ”Write python code to visualize [metric] using [choose viz method]”
Example
image
Explore Data
Prompt: "[Insert data sample] Can you do visualizations & descriptive analyses to help me understand the data?"
Perform Linear Regression
Prompt: ”[insert data sample] Can you try regressions and look for patterns? Can you run regression diagnostics?”

Business Insights and Recommendations

They interpret their findings and provide data-driven insights to support decision-making and improve business processes. Example: A data analyst at a manufacturing company might analyze production data to identify bottlenecks in the assembly line, and recommend process improvements to increase efficiency and reduce costs.

Prompts

Write OKRs
Prompt: Write OKRs for my X people data team. The focus for this quarter is X, Y, Z.
Example:
image

Maintain data documentation

Data analysts are responsible for maintaining documentation of data sources, data dictionaries, and data processing steps to ensure transparency, reproducibility, and easy access to information for other team members. Example: A data analyst working on a financial reporting project might create and maintain a data dictionary outlining the meaning and format of each column in the dataset, as well as document the data processing and transformation steps taken during the analysis.

Prompts

Write documentation for functions
Prompt: I want you to act as a software developer. Please provide documentation for func1 below. [Insert function]
Extract structure out of data sample
Prompt:

"Please help me extract the structure of the following data sample:

Data Sample: [Provide a sample of your data, either as a small dataset, a JSON snippet, or a few rows of a CSV file]

Based on this sample, please provide the inferred structure, including column names, data types, and any relationships or hierarchies that can be observed in the data. Additionally, provide any suggestions or best practices for storing and processing this data using appropriate tools and technologies."

👷🏻
GPT Prompts for Analytics Engineer

ChatGPT can help analytics engineers develop effective data models and transformation logic by providing guidance on best practices, techniques, and tools for data modeling and transformation tasks. Example: An analytics engineer working on customer segmentation might consult ChatGPT for suggestions on feature engineering techniques to enhance the quality of input data for clustering algorithms.

Prompts

Write Jinja Macro
Prompt:

"Please help me create a Jinja macro for my dbt project:

Macro Purpose: [Describe the purpose of the macro, e.g., calculate the age of users, create a timestamp, or format a currency value]

Input Parameters: [List the input parameters required for the macro, including their names and data types]

Expected Output: [Describe the expected output of the macro, including its data type and any specific formatting requirements]

Please provide the Jinja macro code that meets the requirements and can be used in my dbt project, along with an example of how to use the macro in a dbt model SQL file."

Add Runtime Session
Prompt:

"Please help me add a runtime session setting to a model in my dbt project:

Model Name: [Provide the name of the model you want to apply the runtime session setting to]

Session Setting: [Specify the session setting you want to apply, e.g., setting a specific database schema, changing the statement timeout, or adjusting the query priority]

Please provide step-by-step instructions on how to apply the desired runtime session setting to the specified model in my dbt project, including any required code snippets and best practices for implementing session settings in dbt."

dbt model config
Prompt: "Write a dbt model configuration for [use case], including necessary configuration settings such as materialization, schema tests, and any other relevant configurations to optimize the model for the given use case. Make sure to include placeholders where customization is needed.”
image
convert SQL into dbt model
Prompt: Convert this SQL code: [insert code] into dbt model. Make sure you include necessary configuration settings such as materialization, schema tests, and any other relevant configurations to optimize the model for the given use case.
Syntax & function guidance
Prompt: "Provide detailed explanations and examples of common dbt syntax and functions, focusing on their usage in analytics engineering projects. Include explanations of key concepts such as ref(), source(), materializations, incremental models, and schema tests. Make sure to cover both basic and advanced functions, as well as any relevant tips and best practices for their effective application.”

dbt Models / Query Optimization

ChatGPT can assist analytics engineers in optimizing SQL queries for better performance, by providing tips and best practices for writing efficient queries, indexing strategies, and partitioning techniques. Example: An analytics engineer struggling with slow query performance might ask ChatGPT for recommendations on how to optimize a specific SQL query to reduce its execution time.

Prompts

Data Modeling Questions

Ask it general questions about data modeling. The key here compared to Google/SO is that you can ask follow up questions and request examples

Prompt: "How can I design a data model for [YOUR USE CASE] that takes into [DATA POINTS]? Please provide insights on entities, attributes, and relationships."
Example:

"How can I design a data model for a movie streaming platform that takes into account movies, actors, directors, genres, user profiles, and viewing history? Please provide insights on entities, attributes, and relationships."

image
image
List Best Practices
Prompt: "Share dbt best practices for analytics engineers, including but not limited to using incremental materializations, adopting proper naming conventions, and organizing projects with packages. Provide explanations, examples, and tips to ensure that analytics engineers are following industry standards and optimizing their work in dbt.”
Use dbt snapshots for data versioning
Prompt: "Explain the process of using dbt snapshots for data versioning, including the benefits, key concepts, and configuration options. Provide a step-by-step guide on how to create, configure, and manage snapshots in a dbt project, along with best practices for using snapshots effectively. [Make sure to include placeholders for customization depending on the specific use case or dataset.]”
Integrate Airflow in dbt workflow
Prompt: "Explain the process of integrating dbt workflows with orchestration tools like Apache Airflow, including the benefits, key concepts, and best practices. Provide a step-by-step guide on how to set up and configure the integration between dbt and Apache Airflow, including creating DAGs, tasks, and any necessary scripts or configurations. Make sure to include placeholders for customization depending on the specific project requirements and use case.”

Data Validation and Quality Assurance

ChatGPT can help analytics engineers implement robust data validation and quality checks by providing examples of data validation techniques, data quality metrics, and monitoring tools. Example: An analytics engineer building a data pipeline for sales data might use ChatGPT to obtain recommendations on automated data validation processes to ensure data accuracy and consistency.

Data Validation Process
Prompt: "Propose a comprehensive data validation process for a dbt pipeline, including key steps, methodologies, and best practices. Cover aspects such as schema tests, custom data tests, using dbt assertions, and any relevant third-party tools or packages. Provide a step-by-step guide on how to implement and maintain an effective data validation process, making sure to include placeholders for customization depending on the specific project requirements and use case.”
Data Quality Test Procedure
Prompt What is the process to add data quality tests to a dbt model?
Data Quality Test Code
Prompt: ”Write data quality test for the following dbt model: [insert code]

dbt Docs Generator

ChatGPT can support analytics engineers in creating clear and concise documentation of data models, transformation logic, and data pipeline processes to facilitate collaboration and knowledge sharing among team members. Example: An analytics engineer might use ChatGPT to generate a clear and concise explanation of a complex data transformation process, making it easier for other team members to understand and maintain the pipeline.

Prompts

Add description to dbt model
Prompt: explain this dbt model. [Insert Model]. Structure the answer in the following format: - 1 liner title about the model - explain step-by-step how the model works

Comment dbt code
Prompt: [Insert dbt Code] Act as an analytics engineer & add inline comments to explain the most important part of the code. Be consise.
Document dbt schemas
Prompt: [Insert dbt schema] Act as a analytics engineer & describe the schema above.
Identify Gap in Documentation
Prompt: [Insert dbt code] Identify the gap in the documentation of this dbt code. Make suggestions to improve it
Batch Document Columns - High Quality
Prompt: [insert data sample] [insert dbt model] Please document this data table based on the column values & dbt model.
Explain dbt model
Prompt: Explain dbt model with simple terms that a business user can understand.

Exploratory Data Analysis

ChatGPT can assist analytics engineers in conducting exploratory data analysis by suggesting statistical techniques, visualizations, and tools to identify patterns, trends, and relationships in data. Example: An analytics engineer analyzing user engagement data from a mobile app might consult ChatGPT for ideas on which visualizations and statistical tests would be most effective in uncovering insights about user behavior.

Prompts

Suggest Statistical technique
Prompt I want to do [X] with the following data [insert data]. Can you suggest statistical techniques that will help me do [X]. Provide a SQL code sample if possible.
Missing Data Ideas
if I am missing [X] data, what is best way to measure [X]
Find Best Visualisation Ideas
I want to do [X] with the following data [insert data]. Can you suggest the best data visualisation idea for my use-case?

📐 GPT Prompts for Data Architect

Coming Soon