![]() To quickly spot these cases, we could try the following SQL scripts: Either ID columns have some abnormal records, or some null values represent themselves as blank, “Null”, “None”, “N/A”. Much often we would encounter some columns containing not-really-clean data. Relationship between Customer-Address tables via address_id is many-to-one 5 Dirty data Wonder why it is so? That’s your homework for you after reading this blog □ Relationship between Customer-Address tables via address_id is many-to-one. count > table_pk > 0: one record in other table could be mapped with only one record in this table.count = table_pk > 0: one record in other table could be mapped with one or more record in this table.Lastly, we interpret the result using the below logic:.Select C_CUSTOMER_SK as id, C_CURRENT_ADDR_SK as fkįrom SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMERįrom SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER_ADDRESS Select count(distinct customer.id) as customer Then, we use SQL to determine the cardinality:.First of all, determine the PK (primary key) of each table.If you don’t see the cardinality auto-filled in Datagrip, you could manually determine the relationship between 2 tables by following these steps: one C_CURRENT_ADDR_SK in Customer table can be mapped with only one CA_ADDRESS_SK in Address table.one CA_ADDRESS_SK in Address table can be mapped with one or more C_CURRENT_ADDR_SK in Customer table.In the above example, symbol 1.n ↔︎ 1: is understood as: If your dataset have columns constrains well-defined, Datagrip can auto-detect the relationship between the 2 columns you specified in the SQL scripts. However, which kind of relationship is this? Is one X record in table A mapped with one or many record(s) in table B and vice versa? There’s a fancy term for this: cardinality. Now we know table A is connected to table B by column X. So don’t waste your time trying to open & edit it by UML editing software like: StarUML, Lucidchart, Drawio,… 4 Relationship types Note: This UML format is an internal format supported only by DataGrip. Tips: If you want to explore data from this diagram, simply click the table, then press F4 to see detailed data.Īfter finishing with our relational diagram, you have several options to save it: For example, I found that there’s a column contains both null and 0 values, indicating it needs a cleanup at later stage. You can add custom notes to the diagram later when you found some interesting about our data guy. Note: You can click the comment button to make the column comments visible in our diagram. Right-click a database object and select Diagrams | Show Visualization (Ctrl+Alt+Shift+U), and magical things happen, tarraaa! In other words, you need to know which is the primary key, which is the foreign key, and how tables are connected via these keys.ĭataGrip has a great function for us to look at the overall relationships among all the tables. Next, you want to know how each of the tables in the dataset is connected to each other. Pretty simple to this point, right? EDA and DataGrip is not such a hard thing huh □ 3 Relational Diagram I personally often use this instead of typing SQL syntax: describe table X It’s very handy when you want to quickly check the datatype of all the columns, or copy the list of columns’ names. Simply double-click a table in the Database Explorer in Data Grip, the table opens in the data editor.īonus tip: “Ctrl+B” is the shortcut that gives you the DDL of the table. Now you want to know what the actual data in each table looks like. Now all the dishes are on the table, let’s enjoy our data meals. One more click of the chosen table to see the list of its columns names In DataGrip, you just need to click the small box next to the data source name, choose the schemas that you want to explore, then click refresh. 1 Dataset overviewįirst of all, we need an overview of the dataset that we’re about to work with: a list of tables and views, as well as their columns names. They’re all basic SQL syntax, so if you’re using a different version of the SQL language (MySQL, SQL server, BigQuery, etc.), there would be little or no syntax conversion needed. ![]() Note: In this article, I use Snowflake public dataset and Snowflake SQL commands for all SQL queries. So why don’t follow me step by step on how to perform EDA easily with the great help of Datagrip and SQL. ![]() It’s like saying ‘hi’ to your fellow lovely dataset so that we could gain confidence in every extracted information we get from the data later on. This is the first and foremost step to do at the beginning of any project, before we jump into more sophisticated work like refactoring or modeling. Exploratory Data Analysis (EDA) is something that we do pretty frequently.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |