2019-05-31 11:09:56 +03:00
// SPDX-License-Identifier: GPL-2.0-only
2006-01-16 19:50:04 +03:00
/*
* Copyright ( C ) Sistina Software , Inc . 1997 - 2003 All rights reserved .
2008-01-31 19:31:39 +03:00
* Copyright ( C ) 2004 - 2008 Red Hat , Inc . All rights reserved .
2006-01-16 19:50:04 +03:00
*/
# include <linux/spinlock.h>
# include <linux/completion.h>
# include <linux/buffer_head.h>
2006-02-28 01:23:27 +03:00
# include <linux/gfs2_ondisk.h>
2008-05-21 20:03:22 +04:00
# include <linux/bio.h>
2009-10-02 14:54:39 +04:00
# include <linux/posix_acl.h>
2015-12-24 19:09:40 +03:00
# include <linux/security.h>
2006-01-16 19:50:04 +03:00
# include "gfs2.h"
2006-02-28 01:23:27 +03:00
# include "incore.h"
2006-01-16 19:50:04 +03:00
# include "bmap.h"
# include "glock.h"
# include "glops.h"
# include "inode.h"
# include "log.h"
# include "meta_io.h"
# include "recovery.h"
# include "rgrp.h"
2006-02-28 01:23:27 +03:00
# include "util.h"
2006-10-03 19:10:41 +04:00
# include "trans.h"
2011-06-15 13:29:37 +04:00
# include "dir.h"
2019-05-02 22:17:40 +03:00
# include "lops.h"
2006-01-16 19:50:04 +03:00
2014-11-14 05:42:04 +03:00
struct workqueue_struct * gfs2_freeze_wq ;
gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.
This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.
The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.
Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.
The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.
Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.
Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.
We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.
Also, before this patch, if a process encountered an error and
decided to withdraw, if another process was already withdrawing,
the second withdraw would be silently ignored, which set it free
to unlock its glocks. That's correct behavior if the original
withdrawer encounters further errors down the road. But if
secondary waiters don't wait for the journal replay, unlocking
glocks will allow other nodes to use them, despite the fact that
the journal containing those blocks is being replayed. The
replay needs to finish before our glocks are released to other
nodes. IOW, secondary withdraws need to wait for the first
withdraw to finish.
For example, if an rgrp glock is unlocked by a process that didn't
wait for the first withdraw, a journal replay could introduce file
system corruption by replaying a rgrp block that has already been
granted to a different cluster node.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-28 22:23:45 +03:00
extern struct workqueue_struct * gfs2_control_wq ;
2011-08-02 16:09:36 +04:00
static void gfs2_ail_error ( struct gfs2_glock * gl , const struct buffer_head * bh )
{
2021-05-24 19:51:26 +03:00
struct gfs2_sbd * sdp = gl - > gl_name . ln_sbd ;
fs_err ( sdp ,
2015-03-16 19:52:05 +03:00
" AIL buffer %p: blocknr %llu state 0x%08lx mapping %p page "
" state 0x%lx \n " ,
2011-08-02 16:09:36 +04:00
bh , ( unsigned long long ) bh - > b_blocknr , bh - > b_state ,
bh - > b_page - > mapping , bh - > b_page - > flags ) ;
2021-05-24 19:51:26 +03:00
fs_err ( sdp , " AIL glock %u:%llu mapping %p \n " ,
2011-08-02 16:09:36 +04:00
gl - > gl_name . ln_type , gl - > gl_name . ln_number ,
gfs2_glock2aspace ( gl ) ) ;
2021-05-24 19:51:26 +03:00
gfs2_lm ( sdp , " AIL error \n " ) ;
2021-07-30 21:23:49 +03:00
gfs2_withdraw_delayed ( sdp ) ;
2011-08-02 16:09:36 +04:00
}
2006-10-03 19:10:41 +04:00
/**
2011-04-14 12:54:02 +04:00
* __gfs2_ail_flush - remove all buffers for a given lock from the AIL
2006-10-03 19:10:41 +04:00
* @ gl : the glock
2011-09-07 13:33:25 +04:00
* @ fsync : set when called from fsync ( not all buffers will be clean )
2021-03-30 19:44:29 +03:00
* @ nr_revokes : Number of buffers to revoke
2006-10-03 19:10:41 +04:00
*
* None of the buffers should be dirty , locked , or pinned .
*/
2013-07-27 02:09:33 +04:00
static void __gfs2_ail_flush ( struct gfs2_glock * gl , bool fsync ,
unsigned int nr_revokes )
2006-10-03 19:10:41 +04:00
{
2015-03-16 19:52:05 +03:00
struct gfs2_sbd * sdp = gl - > gl_name . ln_sbd ;
2006-10-03 19:10:41 +04:00
struct list_head * head = & gl - > gl_ail_list ;
2011-09-07 13:33:25 +04:00
struct gfs2_bufdata * bd , * tmp ;
2006-10-03 19:10:41 +04:00
struct buffer_head * bh ;
2011-09-07 13:33:25 +04:00
const unsigned long b_state = ( 1UL < < BH_Dirty ) | ( 1UL < < BH_Pinned ) | ( 1UL < < BH_Lock ) ;
2009-02-05 13:12:38 +03:00
2011-09-07 13:33:25 +04:00
gfs2_log_lock ( sdp ) ;
2011-03-11 14:52:25 +03:00
spin_lock ( & sdp - > sd_ail_lock ) ;
2013-07-27 02:09:33 +04:00
list_for_each_entry_safe_reverse ( bd , tmp , head , bd_ail_gl_list ) {
if ( nr_revokes = = 0 )
break ;
2006-10-03 19:10:41 +04:00
bh = bd - > bd_bh ;
2011-09-07 13:33:25 +04:00
if ( bh - > b_state & b_state ) {
if ( fsync )
continue ;
2011-08-02 16:09:36 +04:00
gfs2_ail_error ( gl , bh ) ;
2011-09-07 13:33:25 +04:00
}
2007-09-03 14:01:33 +04:00
gfs2_trans_add_revoke ( sdp , bd ) ;
2013-07-27 02:09:33 +04:00
nr_revokes - - ;
2006-10-03 19:10:41 +04:00
}
2012-10-15 13:57:02 +04:00
GLOCK_BUG_ON ( gl , ! fsync & & atomic_read ( & gl - > gl_ail_count ) ) ;
2011-03-11 14:52:25 +03:00
spin_unlock ( & sdp - > sd_ail_lock ) ;
2011-09-07 13:33:25 +04:00
gfs2_log_unlock ( sdp ) ;
2011-04-14 12:54:02 +04:00
}
2019-11-13 23:09:28 +03:00
static int gfs2_ail_empty_gl ( struct gfs2_glock * gl )
2011-04-14 12:54:02 +04:00
{
2015-03-16 19:52:05 +03:00
struct gfs2_sbd * sdp = gl - > gl_name . ln_sbd ;
2011-04-14 12:54:02 +04:00
struct gfs2_trans tr ;
2021-01-29 18:45:33 +03:00
unsigned int revokes ;
2019-11-13 23:09:28 +03:00
int ret ;
2011-04-14 12:54:02 +04:00
2021-01-29 18:45:33 +03:00
revokes = atomic_read ( & gl - > gl_ail_count ) ;
2011-04-14 12:54:02 +04:00
2021-01-29 18:45:33 +03:00
if ( ! revokes ) {
2019-11-13 22:47:02 +03:00
bool have_revokes ;
bool log_in_flight ;
/*
* We have nothing on the ail , but there could be revokes on
* the sdp revoke queue , in which case , we still want to flush
* the log and wait for it to finish .
*
* If the sdp revoke list is empty too , we might still have an
* io outstanding for writing revokes , so we should wait for
* it before returning .
*
* If none of these conditions are true , our revokes are all
* flushed and we can return .
*/
gfs2_log_lock ( sdp ) ;
have_revokes = ! list_empty ( & sdp - > sd_log_revokes ) ;
log_in_flight = atomic_read ( & sdp - > sd_log_in_flight ) ;
gfs2_log_unlock ( sdp ) ;
if ( have_revokes )
goto flush ;
if ( log_in_flight )
log_flush_wait ( sdp ) ;
2019-11-13 23:09:28 +03:00
return 0 ;
2019-11-13 22:47:02 +03:00
}
2011-04-14 12:54:02 +04:00
2021-01-29 18:45:33 +03:00
memset ( & tr , 0 , sizeof ( tr ) ) ;
set_bit ( TR_ONSTACK , & tr . tr_flags ) ;
ret = __gfs2_trans_begin ( & tr , sdp , 0 , revokes , _RET_IP_ ) ;
if ( ret )
goto flush ;
__gfs2_ail_flush ( gl , 0 , revokes ) ;
2011-04-14 12:54:02 +04:00
gfs2_trans_end ( sdp ) ;
2021-01-29 18:45:33 +03:00
2019-11-13 22:47:02 +03:00
flush :
2018-01-08 18:34:17 +03:00
gfs2_log_flush ( sdp , NULL , GFS2_LOG_HEAD_FLUSH_NORMAL |
GFS2_LFC_AIL_EMPTY_GL ) ;
2019-11-13 23:09:28 +03:00
return 0 ;
2011-04-14 12:54:02 +04:00
}
2006-10-03 19:10:41 +04:00
2011-09-07 13:33:25 +04:00
void gfs2_ail_flush ( struct gfs2_glock * gl , bool fsync )
2011-04-14 12:54:02 +04:00
{
2015-03-16 19:52:05 +03:00
struct gfs2_sbd * sdp = gl - > gl_name . ln_sbd ;
2011-04-14 12:54:02 +04:00
unsigned int revokes = atomic_read ( & gl - > gl_ail_count ) ;
int ret ;
if ( ! revokes )
return ;
gfs2: Per-revoke accounting in transactions
In the log, revokes are stored as a revoke descriptor (struct
gfs2_log_descriptor), followed by zero or more additional revoke blocks
(struct gfs2_meta_header). On filesystems with a blocksize of 4k, the
revoke descriptor contains up to 503 revokes, and the metadata blocks
contain up to 509 revokes each. We've so far been reserving space for
revokes in transactions in block granularity, so a lot more space than
necessary was being allocated and then released again.
This patch switches to assigning revokes to transactions individually
instead. Initially, space for the revoke descriptor is reserved and
handed out to transactions. When more revokes than that are reserved,
additional revoke blocks are added. When the log is flushed, the space
for the additional revoke blocks is released, but we keep the space for
the revoke descriptor block allocated.
Transactions may still reserve more revokes than they will actually need
in the end, but now we won't overshoot the target as much, and by only
returning the space for excess revokes at log flush time, we further
reduce the amount of contention between processes.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-12-17 18:14:30 +03:00
ret = gfs2_trans_begin ( sdp , 0 , revokes ) ;
2011-04-14 12:54:02 +04:00
if ( ret )
return ;
gfs2: Per-revoke accounting in transactions
In the log, revokes are stored as a revoke descriptor (struct
gfs2_log_descriptor), followed by zero or more additional revoke blocks
(struct gfs2_meta_header). On filesystems with a blocksize of 4k, the
revoke descriptor contains up to 503 revokes, and the metadata blocks
contain up to 509 revokes each. We've so far been reserving space for
revokes in transactions in block granularity, so a lot more space than
necessary was being allocated and then released again.
This patch switches to assigning revokes to transactions individually
instead. Initially, space for the revoke descriptor is reserved and
handed out to transactions. When more revokes than that are reserved,
additional revoke blocks are added. When the log is flushed, the space
for the additional revoke blocks is released, but we keep the space for
the revoke descriptor block allocated.
Transactions may still reserve more revokes than they will actually need
in the end, but now we won't overshoot the target as much, and by only
returning the space for excess revokes at log flush time, we further
reduce the amount of contention between processes.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-12-17 18:14:30 +03:00
__gfs2_ail_flush ( gl , fsync , revokes ) ;
2006-10-03 19:10:41 +04:00
gfs2_trans_end ( sdp ) ;
2018-01-08 18:34:17 +03:00
gfs2_log_flush ( sdp , NULL , GFS2_LOG_HEAD_FLUSH_NORMAL |
GFS2_LFC_AIL_FLUSH ) ;
2006-10-03 19:10:41 +04:00
}
2006-07-26 19:27:10 +04:00
2020-10-27 20:29:37 +03:00
/**
* gfs2_rgrp_metasync - sync out the metadata of a resource group
* @ gl : the glock protecting the resource group
*
*/
static int gfs2_rgrp_metasync ( struct gfs2_glock * gl )
{
struct gfs2_sbd * sdp = gl - > gl_name . ln_sbd ;
struct address_space * metamapping = & sdp - > sd_aspace ;
struct gfs2_rgrpd * rgd = gfs2_glock2rgrp ( gl ) ;
const unsigned bsize = sdp - > sd_sb . sb_bsize ;
loff_t start = ( rgd - > rd_addr * bsize ) & PAGE_MASK ;
loff_t end = PAGE_ALIGN ( ( rgd - > rd_addr + rgd - > rd_length ) * bsize ) - 1 ;
int error ;
filemap_fdatawrite_range ( metamapping , start , end ) ;
error = filemap_fdatawait_range ( metamapping , start , end ) ;
WARN_ON_ONCE ( error & & ! gfs2_withdrawn ( sdp ) ) ;
mapping_set_error ( metamapping , error ) ;
if ( error )
gfs2_io_error ( sdp ) ;
return error ;
}
2006-07-26 19:27:10 +04:00
/**
2009-03-09 12:03:51 +03:00
* rgrp_go_sync - sync out the metadata for this glock
2006-01-16 19:50:04 +03:00
* @ gl : the glock
*
* Called when demoting or unlocking an EX glock . We must flush
* to disk all dirty buffers / pages relating to this glock , and must not
2017-06-30 15:55:08 +03:00
* return to caller to demote / unlock the glock until I / O is complete .
2006-01-16 19:50:04 +03:00
*/
2019-11-13 23:09:28 +03:00
static int rgrp_go_sync ( struct gfs2_glock * gl )
2006-01-16 19:50:04 +03:00
{
2015-03-16 19:52:05 +03:00
struct gfs2_sbd * sdp = gl - > gl_name . ln_sbd ;
gfs2: Rework how rgrp buffer_heads are managed
Before this patch, the rgrp code had a serious problem related to
how it managed buffer_heads for resource groups. The problem caused
file system corruption, especially in cases of journal replay.
When an rgrp glock was demoted to transfer ownership to a
different cluster node, do_xmote() first calls rgrp_go_sync and then
rgrp_go_inval, as expected. When it calls rgrp_go_sync, that called
gfs2_rgrp_brelse() that dropped the buffer_head reference count.
In most cases, the reference count went to zero, which is right.
However, there were other places where the buffers are handled
differently.
After rgrp_go_sync, do_xmote called rgrp_go_inval which called
gfs2_rgrp_brelse a second time, then rgrp_go_inval's call to
truncate_inode_pages_range would get rid of the pages in memory,
but only if the reference count drops to 0.
Unfortunately, gfs2_rgrp_brelse was setting bi->bi_bh = NULL.
So when rgrp_go_sync called gfs2_rgrp_brelse, it lost the pointer
to the buffer_heads in cases where the reference count was still 1.
Therefore, when rgrp_go_inval called gfs2_rgrp_brelse a second time,
it failed the check for "if (bi->bi_bh)" and thus failed to call
brelse a second time. Because of that, the reference count on those
buffers sometimes failed to drop from 1 to 0. And that caused
function truncate_inode_pages_range to keep the pages in page cache
rather than freeing them.
The next time the rgrp glock was acquired, the metadata read of
the rgrp buffers re-used the pages in memory, which were now
wrong because they were likely modified by the other node who
acquired the glock in EX (which is why we demoted the glock).
This re-use of the page cache caused corruption because changes
made by the other nodes were never seen, so the bitmaps were
inaccurate.
For some reason, the problem became most apparent when journal
replay forced the replay of rgrps in memory, which caused newer
rgrp data to be overwritten by the older in-core pages.
A big part of the problem was that the rgrp buffer were released
in multiple places: The go_unlock function would release them when
the glock was released rather than when the glock is demoted,
which is clearly wrong because our intent was to cache them until
the glock is demoted from SH or EX.
This patch attempts to clean up the mess and make one consistent
and centralized mechanism for managing the rgrp buffer_heads by
implementing several changes:
1. It eliminates the call to gfs2_rgrp_brelse() from rgrp_go_sync.
We don't want to release the buffers or zero the pointers when
syncing for the reasons stated above. It only makes sense to
release them when the glock is actually invalidated (go_inval).
And when we do, then we set the bh pointers to NULL.
2. The go_unlock function (which was only used for rgrps) is
eliminated, as we've talked about doing many times before.
The go_unlock function was called too early in the glock dq
process, and should not happen until the glock is invalidated.
3. It also eliminates the call to rgrp_brelse in gfs2_clear_rgrpd.
That will now happen automatically when the rgrp glocks are
demoted, and shouldn't happen any sooner or later than that.
Instead, function gfs2_clear_rgrpd has been modified to demote
the rgrp glocks, and therefore, free those pages, before the
remaining glocks are culled by gfs2_gl_hash_clear. This
prevents the gl_object from hanging around when the glocks are
culled.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-13 20:50:30 +03:00
struct gfs2_rgrpd * rgd = gfs2_glock2rgrp ( gl ) ;
2009-03-09 12:03:51 +03:00
int error ;
if ( ! test_and_clear_bit ( GLF_DIRTY , & gl - > gl_flags ) )
2019-11-13 23:09:28 +03:00
return 0 ;
2012-10-15 13:57:02 +04:00
GLOCK_BUG_ON ( gl , gl - > gl_state ! = LM_ST_EXCLUSIVE ) ;
2007-01-22 20:15:34 +03:00
2018-01-08 18:34:17 +03:00
gfs2_log_flush ( sdp , gl , GFS2_LOG_HEAD_FLUSH_NORMAL |
GFS2_LFC_RGRP_GO_SYNC ) ;
2020-10-27 20:29:37 +03:00
error = gfs2_rgrp_metasync ( gl ) ;
2019-11-13 23:09:28 +03:00
if ( ! error )
error = gfs2_ail_empty_gl ( gl ) ;
2020-10-15 19:07:26 +03:00
gfs2_free_clones ( rgd ) ;
2019-11-13 23:09:28 +03:00
return error ;
2006-01-16 19:50:04 +03:00
}
/**
2009-03-09 12:03:51 +03:00
* rgrp_go_inval - invalidate the metadata for this glock
2006-01-16 19:50:04 +03:00
* @ gl : the glock
* @ flags :
*
2009-03-09 12:03:51 +03:00
* We never used LM_ST_DEFERRED with resource groups , so that we
* should always see the metadata flag set here .
*
2006-01-16 19:50:04 +03:00
*/
2009-03-09 12:03:51 +03:00
static void rgrp_go_inval ( struct gfs2_glock * gl , int flags )
2006-01-16 19:50:04 +03:00
{
2015-03-16 19:52:05 +03:00
struct gfs2_sbd * sdp = gl - > gl_name . ln_sbd ;
2013-12-06 20:19:54 +04:00
struct address_space * mapping = & sdp - > sd_aspace ;
2017-06-30 15:55:08 +03:00
struct gfs2_rgrpd * rgd = gfs2_glock2rgrp ( gl ) ;
2020-10-15 19:07:26 +03:00
const unsigned bsize = sdp - > sd_sb . sb_bsize ;
loff_t start = ( rgd - > rd_addr * bsize ) & PAGE_MASK ;
loff_t end = PAGE_ALIGN ( ( rgd - > rd_addr + rgd - > rd_length ) * bsize ) - 1 ;
2015-06-05 16:38:57 +03:00
2020-10-15 19:07:26 +03:00
gfs2_rgrp_brelse ( rgd ) ;
2012-10-15 13:57:02 +04:00
WARN_ON_ONCE ( ! ( flags & DIO_METADATA ) ) ;
2020-10-15 19:07:26 +03:00
truncate_inode_pages_range ( mapping , start , end ) ;
2006-01-16 19:50:04 +03:00
}
2020-10-07 14:30:58 +03:00
static void gfs2_rgrp_go_dump ( struct seq_file * seq , struct gfs2_glock * gl ,
const char * fs_id_buf )
{
2020-11-23 02:10:24 +03:00
struct gfs2_rgrpd * rgd = gl - > gl_object ;
2020-10-07 14:30:58 +03:00
if ( rgd )
gfs2_rgrp_dump ( seq , rgd , fs_id_buf ) ;
}
2017-06-30 15:47:15 +03:00
static struct gfs2_inode * gfs2_glock2inode ( struct gfs2_glock * gl )
{
struct gfs2_inode * ip ;
spin_lock ( & gl - > gl_lockref . lock ) ;
ip = gl - > gl_object ;
if ( ip )
set_bit ( GIF_GLOP_PENDING , & ip - > i_flags ) ;
spin_unlock ( & gl - > gl_lockref . lock ) ;
return ip ;
}
2017-06-30 15:55:08 +03:00
struct gfs2_rgrpd * gfs2_glock2rgrp ( struct gfs2_glock * gl )
{
struct gfs2_rgrpd * rgd ;
spin_lock ( & gl - > gl_lockref . lock ) ;
rgd = gl - > gl_object ;
spin_unlock ( & gl - > gl_lockref . lock ) ;
return rgd ;
}
2017-06-30 15:47:15 +03:00
static void gfs2_clear_glop_pending ( struct gfs2_inode * ip )
{
if ( ! ip )
return ;
clear_bit_unlock ( GIF_GLOP_PENDING , & ip - > i_flags ) ;
wake_up_bit ( & ip - > i_flags , GIF_GLOP_PENDING ) ;
}
2007-01-22 20:15:34 +03:00
/**
2020-10-27 20:29:37 +03:00
* gfs2_inode_metasync - sync out the metadata of an inode
* @ gl : the glock protecting the inode
*
*/
int gfs2_inode_metasync ( struct gfs2_glock * gl )
{
struct address_space * metamapping = gfs2_glock2aspace ( gl ) ;
int error ;
filemap_fdatawrite ( metamapping ) ;
error = filemap_fdatawait ( metamapping ) ;
if ( error )
gfs2_io_error ( gl - > gl_name . ln_sbd ) ;
return error ;
}
/**
* inode_go_sync - Sync the dirty metadata of an inode
2007-01-22 20:15:34 +03:00
* @ gl : the glock protecting the inode
*
*/
2019-11-13 23:09:28 +03:00
static int inode_go_sync ( struct gfs2_glock * gl )
2007-01-22 20:15:34 +03:00
{
2017-06-30 15:47:15 +03:00
struct gfs2_inode * ip = gfs2_glock2inode ( gl ) ;
int isreg = ip & & S_ISREG ( ip - > i_inode . i_mode ) ;
2009-12-08 15:12:13 +03:00
struct address_space * metamapping = gfs2_glock2aspace ( gl ) ;
2020-05-08 17:18:03 +03:00
int error = 0 , ret ;
2007-11-02 11:39:34 +03:00
2017-06-30 15:47:15 +03:00
if ( isreg ) {
2013-12-19 15:04:14 +04:00
if ( test_and_clear_bit ( GIF_SW_PAGED , & ip - > i_flags ) )
unmap_shared_mapping_range ( ip - > i_inode . i_mapping , 0 , 0 ) ;
inode_dio_wait ( & ip - > i_inode ) ;
}
2009-03-09 12:03:51 +03:00
if ( ! test_and_clear_bit ( GLF_DIRTY , & gl - > gl_flags ) )
2017-06-30 15:47:15 +03:00
goto out ;
2007-01-22 20:15:34 +03:00
2012-10-15 13:57:02 +04:00
GLOCK_BUG_ON ( gl , gl - > gl_state ! = LM_ST_EXCLUSIVE ) ;
2009-03-09 12:03:51 +03:00
2018-01-08 18:34:17 +03:00
gfs2_log_flush ( gl - > gl_name . ln_sbd , gl , GFS2_LOG_HEAD_FLUSH_NORMAL |
GFS2_LFC_INODE_GO_SYNC ) ;
2009-03-09 12:03:51 +03:00
filemap_fdatawrite ( metamapping ) ;
2017-06-30 15:47:15 +03:00
if ( isreg ) {
2009-03-09 12:03:51 +03:00
struct address_space * mapping = ip - > i_inode . i_mapping ;
filemap_fdatawrite ( mapping ) ;
error = filemap_fdatawait ( mapping ) ;
mapping_set_error ( mapping , error ) ;
2007-01-22 20:15:34 +03:00
}
2020-10-27 20:29:37 +03:00
ret = gfs2_inode_metasync ( gl ) ;
2020-05-08 17:18:03 +03:00
if ( ! error )
error = ret ;
2009-03-09 12:03:51 +03:00
gfs2_ail_empty_gl ( gl ) ;
2009-04-20 11:58:45 +04:00
/*
* Writeback of the data mapping may cause the dirty flag to be set
* so we have to clear it again here .
*/
2014-03-17 21:06:10 +04:00
smp_mb__before_atomic ( ) ;
2009-04-20 11:58:45 +04:00
clear_bit ( GLF_DIRTY , & gl - > gl_flags ) ;
2017-06-30 15:47:15 +03:00
out :
gfs2_clear_glop_pending ( ip ) ;
2019-11-13 23:09:28 +03:00
return error ;
2007-01-22 20:15:34 +03:00
}
2006-01-16 19:50:04 +03:00
/**
* inode_go_inval - prepare a inode glock to be released
* @ gl : the glock
* @ flags :
2014-06-29 14:21:39 +04:00
*
* Normally we invalidate everything , but if we are moving into
2009-03-09 12:03:51 +03:00
* LM_ST_DEFERRED from LM_ST_SHARED or LM_ST_EXCLUSIVE then we
* can keep hold of the metadata , since it won ' t have changed .
2006-01-16 19:50:04 +03:00
*
*/
static void inode_go_inval ( struct gfs2_glock * gl , int flags )
{
2017-06-30 15:47:15 +03:00
struct gfs2_inode * ip = gfs2_glock2inode ( gl ) ;
2006-01-16 19:50:04 +03:00
2009-03-09 12:03:51 +03:00
if ( flags & DIO_METADATA ) {
2009-12-08 15:12:13 +03:00
struct address_space * mapping = gfs2_glock2aspace ( gl ) ;
2009-03-09 12:03:51 +03:00
truncate_inode_pages ( mapping , 0 ) ;
2009-10-02 14:54:39 +04:00
if ( ip ) {
gfs2: fix GL_SKIP node_scope problems
Before this patch, when a glock was locked, the very first holder on the
queue would unlock the lockref and call the go_instantiate glops function
(if one existed), unless GL_SKIP was specified. When we introduced the new
node-scope concept, we allowed multiple holders to lock glocks in EX mode
and share the lock.
But node-scope introduced a new problem: if the first holder has GL_SKIP
and the next one does NOT, since it is not the first holder on the queue,
the go_instantiate op was not called. Eventually the GL_SKIP holder may
call the instantiate sub-function (e.g. gfs2_rgrp_bh_get) but there was
still a window of time in which another non-GL_SKIP holder assumes the
instantiate function had been called by the first holder. In the case of
rgrp glocks, this led to a NULL pointer dereference on the buffer_heads.
This patch tries to fix the problem by introducing two new glock flags:
GLF_INSTANTIATE_NEEDED, which keeps track of when the instantiate function
needs to be called to "fill in" or "read in" the object before it is
referenced.
GLF_INSTANTIATE_IN_PROG which is used to determine when a process is
in the process of reading in the object. Whenever a function needs to
reference the object, it checks the GLF_INSTANTIATE_NEEDED flag, and if
set, it sets GLF_INSTANTIATE_IN_PROG and calls the glops "go_instantiate"
function.
As before, the gl_lockref spin_lock is unlocked during the IO operation,
which may take a relatively long amount of time to complete. While
unlocked, if another process determines go_instantiate is still needed,
it sees GLF_INSTANTIATE_IN_PROG is set, and waits for the go_instantiate
glop operation to be completed. Once GLF_INSTANTIATE_IN_PROG is cleared,
it needs to check GLF_INSTANTIATE_NEEDED again because the other process's
go_instantiate operation may not have been successful.
Functions that previously called the instantiate sub-functions now call
directly into gfs2_instantiate so the new bits are managed properly.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-06 17:29:18 +03:00
set_bit ( GLF_INSTANTIATE_NEEDED , & gl - > gl_flags ) ;
2009-10-02 14:54:39 +04:00
forget_all_cached_acls ( & ip - > i_inode ) ;
2015-12-24 19:09:40 +03:00
security_inode_invalidate_secctx ( & ip - > i_inode ) ;
2011-06-15 13:29:37 +04:00
gfs2_dir_hash_inval ( ip ) ;
2009-10-02 14:54:39 +04:00
}
2006-11-23 18:51:34 +03:00
}
2015-03-16 19:52:05 +03:00
if ( ip = = GFS2_I ( gl - > gl_name . ln_sbd - > sd_rindex ) ) {
2018-01-17 02:01:33 +03:00
gfs2_log_flush ( gl - > gl_name . ln_sbd , NULL ,
2018-01-08 18:34:17 +03:00
GFS2_LOG_HEAD_FLUSH_NORMAL |
GFS2_LFC_INODE_GO_INVAL ) ;
2015-03-16 19:52:05 +03:00
gl - > gl_name . ln_sbd - > sd_rindex_uptodate = 0 ;
2011-06-13 23:27:40 +04:00
}
2007-10-15 18:40:33 +04:00
if ( ip & & S_ISREG ( ip - > i_inode . i_mode ) )
2006-11-23 18:51:34 +03:00
truncate_inode_pages ( ip - > i_inode . i_mapping , 0 ) ;
2017-06-30 15:47:15 +03:00
gfs2_clear_glop_pending ( ip ) ;
2006-01-16 19:50:04 +03:00
}
/**
* inode_go_demote_ok - Check to see if it ' s ok to unlock an inode glock
* @ gl : the glock
*
* Returns : 1 if it ' s ok
*/
2008-11-20 16:39:47 +03:00
static int inode_go_demote_ok ( const struct gfs2_glock * gl )
2006-01-16 19:50:04 +03:00
{
2015-03-16 19:52:05 +03:00
struct gfs2_sbd * sdp = gl - > gl_name . ln_sbd ;
2011-01-19 12:30:01 +03:00
2008-11-20 16:39:47 +03:00
if ( sdp - > sd_jindex = = gl - > gl_object | | sdp - > sd_rindex = = gl - > gl_object )
return 0 ;
2011-01-19 12:30:01 +03:00
2008-11-20 16:39:47 +03:00
return 1 ;
2006-01-16 19:50:04 +03:00
}
2011-05-09 16:49:59 +04:00
static int gfs2_dinode_in ( struct gfs2_inode * ip , const void * buf )
{
const struct gfs2_dinode * str = buf ;
vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.
The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.
The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.
virtual patch
@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
current_time ( ... )
{
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
...
- return timespec_trunc(
+ return timespec64_trunc(
... );
}
@ depends on patch @
identifier xtime;
@@
struct \( iattr \| inode \| kstat \) {
...
- struct timespec xtime;
+ struct timespec64 xtime;
...
}
@ depends on patch @
identifier t;
@@
struct inode_operations {
...
int (*update_time) (...,
- struct timespec t,
+ struct timespec64 t,
...);
...
}
@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
...) { ... }
@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
) { ... }
@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)
<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)
@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}
@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
|
node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
stat->xtime = node2->i_xtime1;
|
stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-05-09 05:36:02 +03:00
struct timespec64 atime ;
2011-05-09 16:49:59 +04:00
u16 height , depth ;
2021-02-12 21:22:38 +03:00
umode_t mode = be32_to_cpu ( str - > di_mode ) ;
2021-05-19 21:45:56 +03:00
bool is_new = ip - > i_inode . i_state & I_NEW ;
2011-05-09 16:49:59 +04:00
if ( unlikely ( ip - > i_no_addr ! = be64_to_cpu ( str - > di_num . no_addr ) ) )
goto corrupt ;
2021-02-12 21:22:38 +03:00
if ( unlikely ( ! is_new & & inode_wrong_type ( & ip - > i_inode , mode ) ) )
goto corrupt ;
2011-05-09 16:49:59 +04:00
ip - > i_no_formal_ino = be64_to_cpu ( str - > di_num . no_formal_ino ) ;
2021-02-12 21:22:38 +03:00
ip - > i_inode . i_mode = mode ;
if ( is_new ) {
ip - > i_inode . i_rdev = 0 ;
switch ( mode & S_IFMT ) {
case S_IFBLK :
case S_IFCHR :
ip - > i_inode . i_rdev = MKDEV ( be32_to_cpu ( str - > di_major ) ,
be32_to_cpu ( str - > di_minor ) ) ;
break ;
}
2019-10-04 18:55:29 +03:00
}
2011-05-09 16:49:59 +04:00
2013-02-01 10:08:10 +04:00
i_uid_write ( & ip - > i_inode , be32_to_cpu ( str - > di_uid ) ) ;
i_gid_write ( & ip - > i_inode , be32_to_cpu ( str - > di_gid ) ) ;
2017-08-01 19:33:17 +03:00
set_nlink ( & ip - > i_inode , be32_to_cpu ( str - > di_nlink ) ) ;
2011-05-09 16:49:59 +04:00
i_size_write ( & ip - > i_inode , be64_to_cpu ( str - > di_size ) ) ;
gfs2_set_inode_blocks ( & ip - > i_inode , be64_to_cpu ( str - > di_blocks ) ) ;
atime . tv_sec = be64_to_cpu ( str - > di_atime ) ;
atime . tv_nsec = be32_to_cpu ( str - > di_atime_nsec ) ;
vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.
The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.
The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.
virtual patch
@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
current_time ( ... )
{
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
...
- return timespec_trunc(
+ return timespec64_trunc(
... );
}
@ depends on patch @
identifier xtime;
@@
struct \( iattr \| inode \| kstat \) {
...
- struct timespec xtime;
+ struct timespec64 xtime;
...
}
@ depends on patch @
identifier t;
@@
struct inode_operations {
...
int (*update_time) (...,
- struct timespec t,
+ struct timespec64 t,
...);
...
}
@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
...) { ... }
@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
) { ... }
@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)
<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)
@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}
@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
|
node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
stat->xtime = node2->i_xtime1;
|
stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-05-09 05:36:02 +03:00
if ( timespec64_compare ( & ip - > i_inode . i_atime , & atime ) < 0 )
2011-05-09 16:49:59 +04:00
ip - > i_inode . i_atime = atime ;
ip - > i_inode . i_mtime . tv_sec = be64_to_cpu ( str - > di_mtime ) ;
ip - > i_inode . i_mtime . tv_nsec = be32_to_cpu ( str - > di_mtime_nsec ) ;
ip - > i_inode . i_ctime . tv_sec = be64_to_cpu ( str - > di_ctime ) ;
ip - > i_inode . i_ctime . tv_nsec = be32_to_cpu ( str - > di_ctime_nsec ) ;
ip - > i_goal = be64_to_cpu ( str - > di_goal_meta ) ;
ip - > i_generation = be64_to_cpu ( str - > di_generation ) ;
ip - > i_diskflags = be32_to_cpu ( str - > di_flags ) ;
2011-06-16 17:06:55 +04:00
ip - > i_eattr = be64_to_cpu ( str - > di_eattr ) ;
/* i_diskflags and i_eattr must be set before gfs2_set_inode_flags() */
2011-05-09 16:49:59 +04:00
gfs2_set_inode_flags ( & ip - > i_inode ) ;
height = be16_to_cpu ( str - > di_height ) ;
if ( unlikely ( height > GFS2_MAX_META_HEIGHT ) )
goto corrupt ;
ip - > i_height = ( u8 ) height ;
depth = be16_to_cpu ( str - > di_depth ) ;
if ( unlikely ( depth > GFS2_DIR_MAX_DEPTH ) )
goto corrupt ;
ip - > i_depth = ( u8 ) depth ;
ip - > i_entries = be32_to_cpu ( str - > di_entries ) ;
if ( S_ISREG ( ip - > i_inode . i_mode ) )
gfs2_set_aops ( & ip - > i_inode ) ;
return 0 ;
corrupt :
gfs2_consist_inode ( ip ) ;
return - EIO ;
}
/**
* gfs2_inode_refresh - Refresh the incore copy of the dinode
* @ ip : The GFS2 inode
*
* Returns : errno
*/
int gfs2_inode_refresh ( struct gfs2_inode * ip )
{
struct buffer_head * dibh ;
int error ;
error = gfs2_meta_inode_buffer ( ip , & dibh ) ;
if ( error )
return error ;
error = gfs2_dinode_in ( ip , dibh - > b_data ) ;
brelse ( dibh ) ;
return error ;
}
2006-01-16 19:50:04 +03:00
/**
2021-09-29 23:06:21 +03:00
* inode_go_instantiate - read in an inode if necessary
2021-03-30 19:44:29 +03:00
* @ gh : The glock holder
2006-01-16 19:50:04 +03:00
*
* Returns : errno
*/
2022-06-10 13:06:06 +03:00
static int inode_go_instantiate ( struct gfs2_glock * gl )
2006-01-16 19:50:04 +03:00
{
2006-02-28 01:23:27 +03:00
struct gfs2_inode * ip = gl - > gl_object ;
2006-01-16 19:50:04 +03:00
gfs2: fix GL_SKIP node_scope problems
Before this patch, when a glock was locked, the very first holder on the
queue would unlock the lockref and call the go_instantiate glops function
(if one existed), unless GL_SKIP was specified. When we introduced the new
node-scope concept, we allowed multiple holders to lock glocks in EX mode
and share the lock.
But node-scope introduced a new problem: if the first holder has GL_SKIP
and the next one does NOT, since it is not the first holder on the queue,
the go_instantiate op was not called. Eventually the GL_SKIP holder may
call the instantiate sub-function (e.g. gfs2_rgrp_bh_get) but there was
still a window of time in which another non-GL_SKIP holder assumes the
instantiate function had been called by the first holder. In the case of
rgrp glocks, this led to a NULL pointer dereference on the buffer_heads.
This patch tries to fix the problem by introducing two new glock flags:
GLF_INSTANTIATE_NEEDED, which keeps track of when the instantiate function
needs to be called to "fill in" or "read in" the object before it is
referenced.
GLF_INSTANTIATE_IN_PROG which is used to determine when a process is
in the process of reading in the object. Whenever a function needs to
reference the object, it checks the GLF_INSTANTIATE_NEEDED flag, and if
set, it sets GLF_INSTANTIATE_IN_PROG and calls the glops "go_instantiate"
function.
As before, the gl_lockref spin_lock is unlocked during the IO operation,
which may take a relatively long amount of time to complete. While
unlocked, if another process determines go_instantiate is still needed,
it sees GLF_INSTANTIATE_IN_PROG is set, and waits for the go_instantiate
glop operation to be completed. Once GLF_INSTANTIATE_IN_PROG is cleared,
it needs to check GLF_INSTANTIATE_NEEDED again because the other process's
go_instantiate operation may not have been successful.
Functions that previously called the instantiate sub-functions now call
directly into gfs2_instantiate so the new bits are managed properly.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-06 17:29:18 +03:00
if ( ! ip ) /* no inode to populate - read it in later */
2022-06-10 12:42:33 +03:00
return 0 ;
2006-01-16 19:50:04 +03:00
2022-06-10 12:42:33 +03:00
return gfs2_inode_refresh ( ip ) ;
}
static int inode_go_held ( struct gfs2_holder * gh )
{
struct gfs2_glock * gl = gh - > gh_gl ;
struct gfs2_inode * ip = gl - > gl_object ;
int error = 0 ;
if ( ! ip ) /* no inode to populate - read it in later */
return 0 ;
2006-01-16 19:50:04 +03:00
2013-12-19 15:04:14 +04:00
if ( gh - > gh_state ! = LM_ST_DEFERRED )
inode_dio_wait ( & ip - > i_inode ) ;
2008-11-04 13:05:22 +03:00
if ( ( ip - > i_diskflags & GFS2_DIF_TRUNC_IN_PROG ) & &
2006-01-16 19:50:04 +03:00
( gl - > gl_state = = LM_ST_EXCLUSIVE ) & &
2022-06-02 23:15:02 +03:00
( gh - > gh_state = = LM_ST_EXCLUSIVE ) )
error = gfs2_truncatei_resume ( ip ) ;
2006-01-16 19:50:04 +03:00
return error ;
}
2008-05-21 20:03:22 +04:00
/**
* inode_go_dump - print information about an inode
* @ seq : The iterator
2021-03-30 19:44:29 +03:00
* @ gl : The glock
2019-05-09 17:21:48 +03:00
* @ fs_id_buf : file system id ( may be empty )
2008-05-21 20:03:22 +04:00
*
*/
2019-05-09 17:21:48 +03:00
static void inode_go_dump ( struct seq_file * seq , struct gfs2_glock * gl ,
const char * fs_id_buf )
2008-05-21 20:03:22 +04:00
{
2018-04-18 22:05:01 +03:00
struct gfs2_inode * ip = gl - > gl_object ;
struct inode * inode = & ip - > i_inode ;
unsigned long nrpages ;
2008-05-21 20:03:22 +04:00
if ( ip = = NULL )
2014-01-16 14:31:13 +04:00
return ;
2018-04-18 22:05:01 +03:00
xa_lock_irq ( & inode - > i_data . i_pages ) ;
nrpages = inode - > i_data . nrpages ;
xa_unlock_irq ( & inode - > i_data . i_pages ) ;
2019-05-09 17:21:48 +03:00
gfs2_print_dbg ( seq , " %s I: n:%llu/%llu t:%u f:0x%02lx d:0x%08x s:%llu "
" p:%lu \n " , fs_id_buf ,
2008-05-21 20:03:22 +04:00
( unsigned long long ) ip - > i_no_formal_ino ,
( unsigned long long ) ip - > i_no_addr ,
2008-11-10 13:10:12 +03:00
IF2DT ( ip - > i_inode . i_mode ) , ip - > i_flags ,
( unsigned int ) ip - > i_diskflags ,
2018-04-18 22:05:01 +03:00
( unsigned long long ) i_size_read ( inode ) , nrpages ) ;
2008-05-21 20:03:22 +04:00
}
2006-01-16 19:50:04 +03:00
/**
GFS2: remove transaction glock
GFS2 has a transaction glock, which must be grabbed for every
transaction, whose purpose is to deal with freezing the filesystem.
Aside from this involving a large amount of locking, it is very easy to
make the current fsfreeze code hang on unfreezing.
This patch rewrites how gfs2 handles freezing the filesystem. The
transaction glock is removed. In it's place is a freeze glock, which is
cached (but not held) in a shared state by every node in the cluster
when the filesystem is mounted. This lock only needs to be grabbed on
freezing, and actions which need to be safe from freezing, like
recovery.
When a node wants to freeze the filesystem, it grabs this glock
exclusively. When the freeze glock state changes on the nodes (either
from shared to unlocked, or shared to exclusive), the filesystem does a
special log flush. gfs2_log_flush() does all the work for flushing out
the and shutting down the incore log, and then it tries to grab the
freeze glock in a shared state again. Since the filesystem is stuck in
gfs2_log_flush, no new transaction can start, and nothing can be written
to disk. Unfreezing the filesytem simply involes dropping the freeze
glock, allowing gfs2_log_flush() to grab and then release the shared
lock, so it is cached for next time.
However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
shared lock on the filesystem root directory inode to check permissions.
If that glock has already been grabbed exclusively, fsfreeze will be
unable to get the shared lock and unfreeze the filesystem.
In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
on the filesystem root directory during the freeze, and hold it until it
unfreezes the filesystem. The functions which need to grab a shared
lock in order to allow the unfreeze ioctl to be issued now use the lock
grabbed by the freeze code instead.
The freeze and unfreeze code take care to make sure that this shared
lock will not be dropped while another process is using it.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-05-02 07:26:55 +04:00
* freeze_go_sync - promote / demote the freeze glock
2006-01-16 19:50:04 +03:00
* @ gl : the glock
*/
2019-11-13 23:09:28 +03:00
static int freeze_go_sync ( struct gfs2_glock * gl )
2006-01-16 19:50:04 +03:00
{
2014-11-14 05:42:04 +03:00
int error = 0 ;
2015-03-16 19:52:05 +03:00
struct gfs2_sbd * sdp = gl - > gl_name . ln_sbd ;
2006-01-16 19:50:04 +03:00
2020-11-18 16:54:31 +03:00
/*
* We need to check gl_state = = LM_ST_SHARED here and not gl_req = =
* LM_ST_EXCLUSIVE . That ' s because when any node does a freeze ,
* all the nodes should have the freeze glock in SH mode and they all
* call do_xmote : One for EX and the others for UN . They ALL must
* freeze locally , and they ALL must queue freeze work . The freeze_work
* calls freeze_func , which tries to reacquire the freeze glock in SH ,
* effectively waiting for the thaw on the node who holds it in EX .
* Once thawed , the work func acquires the freeze glock in
* SH and everybody goes back to thawed .
*/
2020-11-24 19:41:40 +03:00
if ( gl - > gl_state = = LM_ST_SHARED & & ! gfs2_withdrawn ( sdp ) & &
! test_bit ( SDF_NORECOVERY , & sdp - > sd_flags ) ) {
2014-11-14 05:42:04 +03:00
atomic_set ( & sdp - > sd_freeze_state , SFS_STARTING_FREEZE ) ;
error = freeze_super ( sdp - > sd_vfs ) ;
if ( error ) {
2019-05-13 17:42:18 +03:00
fs_info ( sdp , " GFS2: couldn't freeze filesystem: %d \n " ,
error ) ;
gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.
This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.
The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.
Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.
The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.
Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.
Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.
We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.
Also, before this patch, if a process encountered an error and
decided to withdraw, if another process was already withdrawing,
the second withdraw would be silently ignored, which set it free
to unlock its glocks. That's correct behavior if the original
withdrawer encounters further errors down the road. But if
secondary waiters don't wait for the journal replay, unlocking
glocks will allow other nodes to use them, despite the fact that
the journal containing those blocks is being replayed. The
replay needs to finish before our glocks are released to other
nodes. IOW, secondary withdraws need to wait for the first
withdraw to finish.
For example, if an rgrp glock is unlocked by a process that didn't
wait for the first withdraw, a journal replay could introduce file
system corruption by replaying a rgrp block that has already been
granted to a different cluster node.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-28 22:23:45 +03:00
if ( gfs2_withdrawn ( sdp ) ) {
atomic_set ( & sdp - > sd_freeze_state , SFS_UNFROZEN ) ;
2019-11-13 23:09:28 +03:00
return 0 ;
gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.
This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.
The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.
Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.
The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.
Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.
Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.
We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.
Also, before this patch, if a process encountered an error and
decided to withdraw, if another process was already withdrawing,
the second withdraw would be silently ignored, which set it free
to unlock its glocks. That's correct behavior if the original
withdrawer encounters further errors down the road. But if
secondary waiters don't wait for the journal replay, unlocking
glocks will allow other nodes to use them, despite the fact that
the journal containing those blocks is being replayed. The
replay needs to finish before our glocks are released to other
nodes. IOW, secondary withdraws need to wait for the first
withdraw to finish.
For example, if an rgrp glock is unlocked by a process that didn't
wait for the first withdraw, a journal replay could introduce file
system corruption by replaying a rgrp block that has already been
granted to a different cluster node.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-28 22:23:45 +03:00
}
2014-11-14 05:42:04 +03:00
gfs2_assert_withdraw ( sdp , 0 ) ;
}
queue_work ( gfs2_freeze_wq , & sdp - > sd_freeze_work ) ;
2020-06-25 21:29:44 +03:00
if ( test_bit ( SDF_JOURNAL_LIVE , & sdp - > sd_flags ) )
gfs2_log_flush ( sdp , NULL , GFS2_LOG_HEAD_FLUSH_FREEZE |
GFS2_LFC_FREEZE_GO_SYNC ) ;
else /* read-only mounts */
atomic_set ( & sdp - > sd_freeze_state , SFS_FROZEN ) ;
2006-01-16 19:50:04 +03:00
}
2019-11-13 23:09:28 +03:00
return 0 ;
2006-01-16 19:50:04 +03:00
}
/**
GFS2: remove transaction glock
GFS2 has a transaction glock, which must be grabbed for every
transaction, whose purpose is to deal with freezing the filesystem.
Aside from this involving a large amount of locking, it is very easy to
make the current fsfreeze code hang on unfreezing.
This patch rewrites how gfs2 handles freezing the filesystem. The
transaction glock is removed. In it's place is a freeze glock, which is
cached (but not held) in a shared state by every node in the cluster
when the filesystem is mounted. This lock only needs to be grabbed on
freezing, and actions which need to be safe from freezing, like
recovery.
When a node wants to freeze the filesystem, it grabs this glock
exclusively. When the freeze glock state changes on the nodes (either
from shared to unlocked, or shared to exclusive), the filesystem does a
special log flush. gfs2_log_flush() does all the work for flushing out
the and shutting down the incore log, and then it tries to grab the
freeze glock in a shared state again. Since the filesystem is stuck in
gfs2_log_flush, no new transaction can start, and nothing can be written
to disk. Unfreezing the filesytem simply involes dropping the freeze
glock, allowing gfs2_log_flush() to grab and then release the shared
lock, so it is cached for next time.
However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
shared lock on the filesystem root directory inode to check permissions.
If that glock has already been grabbed exclusively, fsfreeze will be
unable to get the shared lock and unfreeze the filesystem.
In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
on the filesystem root directory during the freeze, and hold it until it
unfreezes the filesystem. The functions which need to grab a shared
lock in order to allow the unfreeze ioctl to be issued now use the lock
grabbed by the freeze code instead.
The freeze and unfreeze code take care to make sure that this shared
lock will not be dropped while another process is using it.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-05-02 07:26:55 +04:00
* freeze_go_xmote_bh - After promoting / demoting the freeze glock
2006-01-16 19:50:04 +03:00
* @ gl : the glock
*/
2021-03-19 14:56:44 +03:00
static int freeze_go_xmote_bh ( struct gfs2_glock * gl )
2006-01-16 19:50:04 +03:00
{
2015-03-16 19:52:05 +03:00
struct gfs2_sbd * sdp = gl - > gl_name . ln_sbd ;
2006-06-14 23:32:57 +04:00
struct gfs2_inode * ip = GFS2_I ( sdp - > sd_jdesc - > jd_inode ) ;
2006-02-28 01:23:27 +03:00
struct gfs2_glock * j_gl = ip - > i_gl ;
2006-10-14 05:47:13 +04:00
struct gfs2_log_header_host head ;
2006-01-16 19:50:04 +03:00
int error ;
2008-05-21 20:03:22 +04:00
if ( test_bit ( SDF_JOURNAL_LIVE , & sdp - > sd_flags ) ) {
2006-11-20 18:37:45 +03:00
j_gl - > gl_ops - > go_inval ( j_gl , DIO_METADATA ) ;
2006-01-16 19:50:04 +03:00
2019-05-02 22:17:40 +03:00
error = gfs2_find_jhead ( sdp - > sd_jdesc , & head , false ) ;
2021-06-01 17:41:40 +03:00
if ( gfs2_assert_withdraw_delayed ( sdp , ! error ) )
return error ;
if ( gfs2_assert_withdraw_delayed ( sdp , head . lh_flags &
GFS2_LOG_HEAD_UNMOUNT ) )
return - EIO ;
sdp - > sd_log_sequence = head . lh_sequence + 1 ;
gfs2_log_pointers_init ( sdp , head . lh_blkno ) ;
2006-01-16 19:50:04 +03:00
}
2008-05-21 20:03:22 +04:00
return 0 ;
2006-01-16 19:50:04 +03:00
}
2008-11-20 16:39:47 +03:00
/**
2021-03-30 19:44:29 +03:00
* freeze_go_demote_ok
2008-11-20 16:39:47 +03:00
* @ gl : the glock
*
* Always returns 0
*/
GFS2: remove transaction glock
GFS2 has a transaction glock, which must be grabbed for every
transaction, whose purpose is to deal with freezing the filesystem.
Aside from this involving a large amount of locking, it is very easy to
make the current fsfreeze code hang on unfreezing.
This patch rewrites how gfs2 handles freezing the filesystem. The
transaction glock is removed. In it's place is a freeze glock, which is
cached (but not held) in a shared state by every node in the cluster
when the filesystem is mounted. This lock only needs to be grabbed on
freezing, and actions which need to be safe from freezing, like
recovery.
When a node wants to freeze the filesystem, it grabs this glock
exclusively. When the freeze glock state changes on the nodes (either
from shared to unlocked, or shared to exclusive), the filesystem does a
special log flush. gfs2_log_flush() does all the work for flushing out
the and shutting down the incore log, and then it tries to grab the
freeze glock in a shared state again. Since the filesystem is stuck in
gfs2_log_flush, no new transaction can start, and nothing can be written
to disk. Unfreezing the filesytem simply involes dropping the freeze
glock, allowing gfs2_log_flush() to grab and then release the shared
lock, so it is cached for next time.
However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
shared lock on the filesystem root directory inode to check permissions.
If that glock has already been grabbed exclusively, fsfreeze will be
unable to get the shared lock and unfreeze the filesystem.
In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
on the filesystem root directory during the freeze, and hold it until it
unfreezes the filesystem. The functions which need to grab a shared
lock in order to allow the unfreeze ioctl to be issued now use the lock
grabbed by the freeze code instead.
The freeze and unfreeze code take care to make sure that this shared
lock will not be dropped while another process is using it.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-05-02 07:26:55 +04:00
static int freeze_go_demote_ok ( const struct gfs2_glock * gl )
2008-11-20 16:39:47 +03:00
{
return 0 ;
}
2009-07-24 03:52:34 +04:00
/**
* iopen_go_callback - schedule the dcache entry for the inode to be deleted
* @ gl : the glock
2021-03-30 19:44:29 +03:00
* @ remote : true if this came from a different cluster node
2009-07-24 03:52:34 +04:00
*
2015-10-29 18:58:09 +03:00
* gl_lockref . lock lock is held while calling this
2009-07-24 03:52:34 +04:00
*/
2013-04-10 13:26:55 +04:00
static void iopen_go_callback ( struct gfs2_glock * gl , bool remote )
2009-07-24 03:52:34 +04:00
{
2017-06-30 15:55:08 +03:00
struct gfs2_inode * ip = gl - > gl_object ;
2015-03-16 19:52:05 +03:00
struct gfs2_sbd * sdp = gl - > gl_name . ln_sbd ;
2011-03-30 17:17:51 +04:00
2017-07-17 10:45:34 +03:00
if ( ! remote | | sb_rdonly ( sdp - > sd_vfs ) )
2011-03-30 17:17:51 +04:00
return ;
2009-07-24 03:52:34 +04:00
if ( gl - > gl_demote_state = = LM_ST_UNLOCKED & &
2009-12-08 15:12:13 +03:00
gl - > gl_state = = LM_ST_SHARED & & ip ) {
2013-10-15 18:18:08 +04:00
gl - > gl_lockref . count + + ;
2020-01-16 22:12:26 +03:00
if ( ! queue_delayed_work ( gfs2_delete_workqueue ,
& gl - > gl_delete , 0 ) )
2013-10-15 18:18:08 +04:00
gl - > gl_lockref . count - - ;
2009-07-24 03:52:34 +04:00
}
}
2020-01-16 22:12:26 +03:00
static int iopen_go_demote_ok ( const struct gfs2_glock * gl )
{
return ! gfs2_delete_work_queued ( gl ) ;
}
gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.
This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.
The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.
Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.
The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.
Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.
Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.
We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.
Also, before this patch, if a process encountered an error and
decided to withdraw, if another process was already withdrawing,
the second withdraw would be silently ignored, which set it free
to unlock its glocks. That's correct behavior if the original
withdrawer encounters further errors down the road. But if
secondary waiters don't wait for the journal replay, unlocking
glocks will allow other nodes to use them, despite the fact that
the journal containing those blocks is being replayed. The
replay needs to finish before our glocks are released to other
nodes. IOW, secondary withdraws need to wait for the first
withdraw to finish.
For example, if an rgrp glock is unlocked by a process that didn't
wait for the first withdraw, a journal replay could introduce file
system corruption by replaying a rgrp block that has already been
granted to a different cluster node.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-28 22:23:45 +03:00
/**
* inode_go_free - wake up anyone waiting for dlm ' s unlock ast to free it
* @ gl : glock being freed
*
* For now , this is only used for the journal inode glock . In withdraw
* situations , we need to wait for the glock to be freed so that we know
* other nodes may proceed with recovery / journal replay .
*/
static void inode_go_free ( struct gfs2_glock * gl )
{
/* Note that we cannot reference gl_object because it's already set
* to NULL by this point in its lifecycle . */
if ( ! test_bit ( GLF_FREEING , & gl - > gl_flags ) )
return ;
clear_bit_unlock ( GLF_FREEING , & gl - > gl_flags ) ;
wake_up_bit ( & gl - > gl_flags , GLF_FREEING ) ;
}
/**
* nondisk_go_callback - used to signal when a node did a withdraw
* @ gl : the nondisk glock
* @ remote : true if this came from a different cluster node
*
*/
static void nondisk_go_callback ( struct gfs2_glock * gl , bool remote )
{
struct gfs2_sbd * sdp = gl - > gl_name . ln_sbd ;
/* Ignore the callback unless it's from another node, and it's the
live lock . */
if ( ! remote | | gl - > gl_name . ln_number ! = GFS2_LIVE_LOCK )
return ;
/* First order of business is to cancel the demote request. We don't
* really want to demote a nondisk glock . At best it ' s just to inform
* us of another node ' s withdraw . We ' ll keep it in SH mode . */
clear_bit ( GLF_DEMOTE , & gl - > gl_flags ) ;
clear_bit ( GLF_PENDING_DEMOTE , & gl - > gl_flags ) ;
/* Ignore the unlock if we're withdrawn, unmounting, or in recovery. */
if ( test_bit ( SDF_NORECOVERY , & sdp - > sd_flags ) | |
test_bit ( SDF_WITHDRAWN , & sdp - > sd_flags ) | |
test_bit ( SDF_REMOTE_WITHDRAW , & sdp - > sd_flags ) )
return ;
/* We only care when a node wants us to unlock, because that means
* they want a journal recovered . */
if ( gl - > gl_demote_state ! = LM_ST_UNLOCKED )
return ;
if ( sdp - > sd_args . ar_spectator ) {
fs_warn ( sdp , " Spectator node cannot recover journals. \n " ) ;
return ;
}
fs_warn ( sdp , " Some node has withdrawn; checking for recovery. \n " ) ;
set_bit ( SDF_REMOTE_WITHDRAW , & sdp - > sd_flags ) ;
/*
* We can ' t call remote_withdraw directly here or gfs2_recover_journal
* because this is called from the glock unlock function and the
* remote_withdraw needs to enqueue and dequeue the same " live " glock
* we were called from . So we queue it to the control work queue in
* lock_dlm .
*/
queue_delayed_work ( gfs2_control_wq , & sdp - > sd_control_work , 0 ) ;
}
2006-08-30 17:30:00 +04:00
const struct gfs2_glock_operations gfs2_meta_glops = {
2006-09-05 18:53:09 +04:00
. go_type = LM_TYPE_META ,
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-13 21:28:45 +03:00
. go_flags = GLOF_NONDISK ,
2006-01-16 19:50:04 +03:00
} ;
2006-08-30 17:30:00 +04:00
const struct gfs2_glock_operations gfs2_inode_glops = {
2012-10-24 22:41:05 +04:00
. go_sync = inode_go_sync ,
2006-01-16 19:50:04 +03:00
. go_inval = inode_go_inval ,
. go_demote_ok = inode_go_demote_ok ,
2021-09-29 23:06:21 +03:00
. go_instantiate = inode_go_instantiate ,
2022-06-10 12:42:33 +03:00
. go_held = inode_go_held ,
2008-05-21 20:03:22 +04:00
. go_dump = inode_go_dump ,
2006-09-05 18:53:09 +04:00
. go_type = LM_TYPE_INODE ,
2020-01-13 23:21:49 +03:00
. go_flags = GLOF_ASPACE | GLOF_LRU | GLOF_LVB ,
gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.
This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.
The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.
Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.
The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.
Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.
Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.
We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.
Also, before this patch, if a process encountered an error and
decided to withdraw, if another process was already withdrawing,
the second withdraw would be silently ignored, which set it free
to unlock its glocks. That's correct behavior if the original
withdrawer encounters further errors down the road. But if
secondary waiters don't wait for the journal replay, unlocking
glocks will allow other nodes to use them, despite the fact that
the journal containing those blocks is being replayed. The
replay needs to finish before our glocks are released to other
nodes. IOW, secondary withdraws need to wait for the first
withdraw to finish.
For example, if an rgrp glock is unlocked by a process that didn't
wait for the first withdraw, a journal replay could introduce file
system corruption by replaying a rgrp block that has already been
granted to a different cluster node.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-28 22:23:45 +03:00
. go_free = inode_go_free ,
2006-01-16 19:50:04 +03:00
} ;
2006-08-30 17:30:00 +04:00
const struct gfs2_glock_operations gfs2_rgrp_glops = {
2012-10-24 22:41:05 +04:00
. go_sync = rgrp_go_sync ,
2009-03-09 12:03:51 +03:00
. go_inval = rgrp_go_inval ,
2021-09-29 23:06:21 +03:00
. go_instantiate = gfs2_rgrp_go_instantiate ,
2020-10-07 14:30:58 +03:00
. go_dump = gfs2_rgrp_go_dump ,
2006-09-05 18:53:09 +04:00
. go_type = LM_TYPE_RGRP ,
2013-12-06 20:19:54 +04:00
. go_flags = GLOF_LVB ,
2006-01-16 19:50:04 +03:00
} ;
GFS2: remove transaction glock
GFS2 has a transaction glock, which must be grabbed for every
transaction, whose purpose is to deal with freezing the filesystem.
Aside from this involving a large amount of locking, it is very easy to
make the current fsfreeze code hang on unfreezing.
This patch rewrites how gfs2 handles freezing the filesystem. The
transaction glock is removed. In it's place is a freeze glock, which is
cached (but not held) in a shared state by every node in the cluster
when the filesystem is mounted. This lock only needs to be grabbed on
freezing, and actions which need to be safe from freezing, like
recovery.
When a node wants to freeze the filesystem, it grabs this glock
exclusively. When the freeze glock state changes on the nodes (either
from shared to unlocked, or shared to exclusive), the filesystem does a
special log flush. gfs2_log_flush() does all the work for flushing out
the and shutting down the incore log, and then it tries to grab the
freeze glock in a shared state again. Since the filesystem is stuck in
gfs2_log_flush, no new transaction can start, and nothing can be written
to disk. Unfreezing the filesytem simply involes dropping the freeze
glock, allowing gfs2_log_flush() to grab and then release the shared
lock, so it is cached for next time.
However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
shared lock on the filesystem root directory inode to check permissions.
If that glock has already been grabbed exclusively, fsfreeze will be
unable to get the shared lock and unfreeze the filesystem.
In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
on the filesystem root directory during the freeze, and hold it until it
unfreezes the filesystem. The functions which need to grab a shared
lock in order to allow the unfreeze ioctl to be issued now use the lock
grabbed by the freeze code instead.
The freeze and unfreeze code take care to make sure that this shared
lock will not be dropped while another process is using it.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-05-02 07:26:55 +04:00
const struct gfs2_glock_operations gfs2_freeze_glops = {
. go_sync = freeze_go_sync ,
. go_xmote_bh = freeze_go_xmote_bh ,
. go_demote_ok = freeze_go_demote_ok ,
2006-09-05 18:53:09 +04:00
. go_type = LM_TYPE_NONDISK ,
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-13 21:28:45 +03:00
. go_flags = GLOF_NONDISK ,
2006-01-16 19:50:04 +03:00
} ;
2006-08-30 17:30:00 +04:00
const struct gfs2_glock_operations gfs2_iopen_glops = {
2006-09-05 18:53:09 +04:00
. go_type = LM_TYPE_IOPEN ,
2009-07-24 03:52:34 +04:00
. go_callback = iopen_go_callback ,
2021-12-14 18:40:12 +03:00
. go_dump = inode_go_dump ,
2020-01-16 22:12:26 +03:00
. go_demote_ok = iopen_go_demote_ok ,
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-13 21:28:45 +03:00
. go_flags = GLOF_LRU | GLOF_NONDISK ,
2020-11-23 18:53:35 +03:00
. go_subclass = 1 ,
2006-01-16 19:50:04 +03:00
} ;
2006-08-30 17:30:00 +04:00
const struct gfs2_glock_operations gfs2_flock_glops = {
2006-09-05 18:53:09 +04:00
. go_type = LM_TYPE_FLOCK ,
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-13 21:28:45 +03:00
. go_flags = GLOF_LRU | GLOF_NONDISK ,
2006-01-16 19:50:04 +03:00
} ;
2006-08-30 17:30:00 +04:00
const struct gfs2_glock_operations gfs2_nondisk_glops = {
2006-09-05 18:53:09 +04:00
. go_type = LM_TYPE_NONDISK ,
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-13 21:28:45 +03:00
. go_flags = GLOF_NONDISK ,
gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.
This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.
The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.
Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.
The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.
Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.
Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.
We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.
Also, before this patch, if a process encountered an error and
decided to withdraw, if another process was already withdrawing,
the second withdraw would be silently ignored, which set it free
to unlock its glocks. That's correct behavior if the original
withdrawer encounters further errors down the road. But if
secondary waiters don't wait for the journal replay, unlocking
glocks will allow other nodes to use them, despite the fact that
the journal containing those blocks is being replayed. The
replay needs to finish before our glocks are released to other
nodes. IOW, secondary withdraws need to wait for the first
withdraw to finish.
For example, if an rgrp glock is unlocked by a process that didn't
wait for the first withdraw, a journal replay could introduce file
system corruption by replaying a rgrp block that has already been
granted to a different cluster node.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-28 22:23:45 +03:00
. go_callback = nondisk_go_callback ,
2006-01-16 19:50:04 +03:00
} ;
2006-08-30 17:30:00 +04:00
const struct gfs2_glock_operations gfs2_quota_glops = {
2006-09-05 18:53:09 +04:00
. go_type = LM_TYPE_QUOTA ,
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-13 21:28:45 +03:00
. go_flags = GLOF_LVB | GLOF_LRU | GLOF_NONDISK ,
2006-01-16 19:50:04 +03:00
} ;
2006-08-30 17:30:00 +04:00
const struct gfs2_glock_operations gfs2_journal_glops = {
2006-09-05 18:53:09 +04:00
. go_type = LM_TYPE_JOURNAL ,
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-13 21:28:45 +03:00
. go_flags = GLOF_NONDISK ,
2006-01-16 19:50:04 +03:00
} ;
GFS2: Add a "demote a glock" interface to sysfs
This adds a sysfs file called demote_rq to GFS2's
per filesystem directory. Its possible to use this
file to demote arbitrary glocks in exactly the same
way as if a request had come in from a remote node.
This is intended for testing issues relating to caching
of data under glocks. Despite that, the interface is
generic enough to send requests to any type of glock,
but be careful as its not always safe to send an
arbitrary message to an arbitrary glock. For that reason
and to prevent DoS, this interface is restricted to root
only.
The messages look like this:
<type>:<glocknumber> <mode>
Example:
echo -n "2:13324 EX" >/sys/fs/gfs2/unity:myfs/demote_rq
Which means "please demote inode glock (type 2) number 13324 so that
I can get an EX (exclusive) lock". The lock modes are those which
would normally be sent by a remote node in its callback so if you
want to unlock a glock, you use EX, to demote to shared, use SH or PR
(depending on whether you like GFS2 or DLM lock modes better!).
If the glock doesn't exist, you'll get -ENOENT returned. If the
arguments don't make sense, you'll get -EINVAL returned.
The plan is that this interface will be used in combination with
the blktrace patch which I recently posted for comments although
it is, of course, still useful in its own right.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-02-12 16:31:58 +03:00
const struct gfs2_glock_operations * gfs2_glops_list [ ] = {
[ LM_TYPE_META ] = & gfs2_meta_glops ,
[ LM_TYPE_INODE ] = & gfs2_inode_glops ,
[ LM_TYPE_RGRP ] = & gfs2_rgrp_glops ,
[ LM_TYPE_IOPEN ] = & gfs2_iopen_glops ,
[ LM_TYPE_FLOCK ] = & gfs2_flock_glops ,
[ LM_TYPE_NONDISK ] = & gfs2_nondisk_glops ,
[ LM_TYPE_QUOTA ] = & gfs2_quota_glops ,
[ LM_TYPE_JOURNAL ] = & gfs2_journal_glops ,
} ;