How to Read Messy Tab-Delimited .DAT Files with Grouped Lines in R
Автор: vlogommentary
Загружено: 2026-01-23
Просмотров: 1
Описание:
Learn a clean and efficient method to read irregular tab-delimited .DAT files in R by grouping related lines and parsing them into structured data.
---
This video is based on the question https://stackoverflow.com/q/79376510/ asked by the user 'afleishman' ( https://stackoverflow.com/u/4424306/ ) and on the answer https://stackoverflow.com/a/79376876/ provided by the user 'margusl' ( https://stackoverflow.com/u/646761/ ) at 'Stack Overflow' website. Thanks to these great users and Stackexchange community for their contributions.
Visit these links for original content and any more details, such as alternate solutions, latest updates/developments on topic, comments, revision history etc. For example, the original title of the Question was: Reading .DAT file with odd tab-delimited structure in r
Also, Content (except music) licensed under CC BY-SA https://meta.stackexchange.com/help/l...
The original Question post is licensed under the 'CC BY-SA 4.0' ( https://creativecommons.org/licenses/... ) license, and the original Answer post is licensed under the 'CC BY-SA 4.0' ( https://creativecommons.org/licenses/... ) license.
If anything seems off to you, please feel free to drop me a comment under this video.
---
Introduction
When working with .DAT files that are supposed to be tab-delimited but include irregular lines (such as free text without tabs), standard functions like read_tsv() may fail or produce incorrect output. This often happens when data rows span multiple lines or contain notes embedded beneath main records.
The Challenge
You have a .DAT file where:
Each record should have five columns:
Numeric ID
Date (MM/DD/YYYY)
Time (HH:MM or HH:MM:SS)
Free text field
Free text field
However, the file also contains lines without tabs that belong to the previous record's last column.
For example:
[[See Video to Reveal this Text or Code Snippet]]
Here, the lines without tabs ("UNKNOWN", "CONTRAINDICATION, STOP") are continuation lines for the first record's last column.
The Solution: Group and Collapse Related Lines
We can solve this by:
Reading all lines as strings using readLines() or readr::read_lines().
Identifying record starts: Lines containing tabs indicate a new record start.
Grouping lines: Use cumulative sums on presence of tabs to group related lines.
Collapsing lines in each group: Concatenate all lines belonging to the same record, separating continuation lines with ", ".
Parsing the cleaned data: Apply readr::read_tsv() on the collapsed strings.
Concise R Code Implementation
[[See Video to Reveal this Text or Code Snippet]]
Explanation
grepl("\t", line) returns a logical vector identifying lines with tabs (record starts).
cumsum() turns this into a grouping integer that increments only when a new record starts.
summarise(paste(...)) joins all lines of a record into one string with comma-separated continuation texts.
Finally, read_tsv() easily parses the well-structured tab-delimited data.
Result
The output dataframe will have five columns:
X1: Numeric identifier
X2: Date
X3: Time
X4: Free text
X5: Concatenated free text from continuation lines
This method is robust as long as continuation lines never contain tabs themselves.
Summary
Handling irregular tab-delimited files with continuation lines can be tricky, but simple grouping based on tab presence combined with collapsing lines enables clean parsing into tidy data frames.
Keep this pattern handy when your data doesn't fit neatly into standard delimited formats!
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: